Semantic Keyword Research with KNIME and Social Media Data Mining – #BrightonSEO 2015

I had the opportunity to travel to the UK in April and speak at BrightonSEO, an SEO conference I’ve always admired from afar in the United States.

Needless to say, it was an incredible experience. I return from England having connected with many of my European SEO brethren, a liking of beans for breakfast, and with the word “garbage” stricken from my vocabulary and replaced with “rubbish”.

For those of you who did not make it out to see my presentation, or for those who attended and yearned for greater detail, I present to you this recap of my presentation…

Why Do Semantic Keyword Research?

Search Engines such as Google and Bing are coming to rely more heavily on semantic search technology to better understand the websites in their indices and what people mean when they search.

The prevalence of these technologies means that it is time for SEOs to adapt once again and better understand language usage and how keywords relate to each other conceptually.

The strength of the keywords’ conceptual connection may be scored for relevancy on-page, within a search query or combination of the two.

Let’s do semantic keyword research!

What is Semantic Search?The transition from strings to things.

Note: Some these concepts are simplified to ease reader understanding.

At a very high-level the idea of semantic search is about looking past a piece of text on the web as a series of keywords and understanding their meaning and how they relate to each other.

This helps with relevancy scoring and to better understanding and interpret the meaning of text. This might affect how the search engines come to understand the intent of a search query or determine if a website is a good match for that search.

There are two ways to look at data for semantic search:

Through the lens of structured data or unstructured data.

Structured data means reading user-provided markup to understand how concepts and entities might relate to each other. Schema.org markup on a page is a source of structured data on a web page that the search engines can use to understand it semantically.

Search engines may also examine web text semantically without the presence of structured data, using technologies such as Natural Language Processing and Machine Learning algorithms. Search Engines may also use data provided by web pages marked up with structured data to be able to understand unstructured data better.

In our case, we are not concerned as much with structured data and are focusing on semantic search in the context of unstructured data.

In my presentation, I gave the following example search:

“What is a mammal that has a vertebrate and lives in water?”

The search engines may break out the search this way:

And then interpret it as:

You can try this example in Google Search. In most cases, Google produces information about whales and related animals.

About Google Hummingbird

It is difficult to discuss the prevalence of Semantic Search without at least mentioning Google Hummingbird.

Hummingbird is at its core, more of an infrastructure update, but it does have some new technology baked into the new engine.

Amit Singhal, the head of Google’s core ranking team, discussed some of Hummingbird’s conversation Search capabilities with Danny Sullivan:

“Hummingbird is paying more attention to each word in a query, ensuring that the whole query — the whole sentence or conversation or meaning — is taken into account, rather than particular words. The goal is that pages matching the meaning do better, rather than pages matching just a few words.”

It is clear that Google has incorporated more semantic search technology with the introduction of Hummingbird.

He boils down Google’s potential new capabilities post-Hummingbird, saying it should be able:

To better understand the intent of a query;

To broaden the pool of web documents that may answer that query;

To simplify how it delivers information, because if query A, query B, and query C substantively mean the same thing, Google doesn’t need to propose three different SERPs, but just one;

To offer a better search experience, because expanding the query and better understanding the relationships between search entities (also based on direct/indirect personalization elements), Google can now offer results that have a higher probability of satisfying the needs of the user.

As a consequence, Google may present better SERPs also in terms of better ads, because in 99% of the cases, verbose queries were not presenting ads in their SERPs before Hummingbird.

How Can SEOs Optimize for Semantic Search?

At a high level, you want to make sure that you are creating high-quality content that delights your users, paying close attention to searcher intent. Mapping content to personas and categorizing keyword as navigational, transactional, or informational may also help with this endeavor.

“Now this is great content”

After that, keywords on your website should be semantically related, not necessarily on the page level. To help with this, start thinking about your website in terms of related topical buckets. One of your goals should be to have the search engines perceive your site as an authority for each one of of those topical buckets.

Your website should be able to be broken down into one or more broad interrelated topics, likely representable by short tail keywords. Each one of these topics can be thought of like a bucket.

Within each of those buckets reside sub-concepts or keywords, often long-tail keywords, but not necessarily (represented as red ball above). They should relate to the other topic buckets on your website, but even more so to the bucket they are contained inside.

Creating quality content to represent those sub-concepts, that earn links helps build the topical authority of its bucket and for the website overall.

When creating your buckets, it is helpful to have an exceptional understanding of consumer language, and the myriad of ways that users may search in relation to your website’s topic.

At a bare minimum, you need to understand the following language search perspectives:

What are consumers searching for when they are familiar with your topic?

Language used should represent your core keywords.

What are consumers searching for when they are not familiar with your topic?

Language tends to be more conversational. You may uncover additional related terms when exploring your topic from this perspective.

What else do these two groups search for typically?

These searches may be directly and/or indirectly related to your topic.

Looking at your topic like this will help form a foundation of keywords that we will use in our topic buckets and expand upon in our semantic keyword research. Later on, we’ll examine the semantic relationship between these keywords using data visualization to simplify the selection process.

Why Social Media Data is an Awesome Data Source for Semantic Keyword Research

When conduction keyword research, it is intuitive to factor in SERP data, but an incredible secondary data source is social media.

Reasons to use Social Media for Keyword Research

Social data helps us expand your collection of keyword ideas, especially when it comes to newer, fresh keywords.

Social Networking language is inherently conversational and can help you understand the phrasing of conversational queries.

We can use Social language to mimic the language of the user, which has a secondary CRO benefit.

Note: I typically focus on Twitter for this data since it has an existing infrastructure for data mining, and it is easiest to work with of all the social networks.

Secondary Benefit, CRO: The Echo Effect

This is a bit of tangent, but it is worth mentioning. While you are already doing social data mining, you might as well use this information to better your copywriting. There are several academic studies that indicate that that mimicking the language of the consumer (we will derive this from Twitter text), help to build trust and improve conversions1:

A study published in the International Journal of Hospitality Management demonstrated that waitresses who copied the language of a person’s order word-for-word were given higher tips on average.

Another study, published in the Journal of Language and Social Psychology discusses how mimicking peoples’ language can help with building likability, safety, and rapport–all aspects of effective copywriting.

Moving on…

Let’s say we’ve collected massive amounts of data. Some of that data will come from websites ranking in the SERPs for relevant keywords and some from social networks like Twitter.

What kind of simple analyses can we do to help with our semantic keyword research?

You can very easily examine that data through the lens of:

Co-Occurence: How often two or more words appear along side each other in a corpus of documents (in our case, websites and Tweets)

Quality Visualization Will Help Make Use of the Data

Semantic relations can be difficult to incorporate into your typical Excel-based keyword research document, so there are some data visualizations that KNIME will produce that we can use easily process this information.

The most useful have been a simple color-codes word cloud (depicted left) and a node graph visualization (depicted right). I’ll come back to these later.

The Basics of KNIME

Let’s start with “nodes”, the building blocks of a KNIME project.

What is a Node?

Nodes are pre-built drag-and-drop boxes designed to do a single task. There are a HUGE number of pre-built nodes in KNIME that are useful for marketing and beyond.

KNIME nodes are combined together into “workflows” to accomplish larger, more complex tasks.

Nodes can be grouped together into meta nodes that can be configured in unison.

That’s right, there is even pre-build Google Analytics nodes. KNIME can be used bioinformatics AND marketing.

How do you add Nodes?

The KNIME interface is somewhat customizable, but typically you can find your list of nodes on the left-hand panel within the “Node Repository”.

If you installed the correct version, you should have access to hundreds already.

To use a node, it’s as simple as finding the one you want to use and click-and-dragging into a workflow tab.

Demonstration: How to click-and-drag a node from the “node repository” into your workflow tab.

How do you connect nodes to one another?

To connect nodes to one another, it is also a click-and-drag action.

Nodes have input and output “ports” that look like a little white triangle on the left and right sides. You click-and-drag from an output port to an input port.

Demonstration: How do you connect KNIME nodes to each other.

Note: Honestly, KNIME seems intimidating at first, but it’s SUPER easy. The trickiest part is becoming familiar with which nodes are available, what they are called, and which ones can connect together. You can learn about that by reading the documentation for each node in the “Node Description” area.

Configuring the Nodes in your Workflow

Once you’ve added and connected nodes in your workflow, depending on the node, it may be necessary to change their settings.

To change a node’s setting, you simply right-click and choose “Configure”.

A settings dialog will pop-up. Each node will have a different setting interface, but most of them are self-explanatory.

The example above is the “Table Creator” node (very useful for some quick text entry) and it’s setting dialog looks like a basic Microsoft Excel Spreadsheet, and it functions about the same.

How to Run Your Workflow

There are a few ways you can run your KNIME workflow. You can right-click and choose “Execute” to run an individual node (If you choose select on the last node in a linear workflow, then all of the previous nodes should run as well)

…Or you can click the green circle with the double white arrow from top to run all of the nodes in the workflow.

How to Extract Twitter Data with KNIME

I’ve already mentioned the merits of using Twitter as a data. source for your semantic keyword research and thankfully, this is very easy to do within KNIME. Here’s how…

SERP Data in Your KNIME Workflow

The next source of text data you should be using is very obviously, search result data.

If I do a search for a keyword and look at the pages ranking for the top 10 results, we know that Google has determined by its full range of ranking factors that it considers those the best pages to match that query.

Using KNIME we can extract the text from those pages and use them as a seed for our keyword analysis, just like we’ve done with Twitter.

There are a number of ways to go about this…

Inputting SERP Data Manually

We can use a rank checking software like AWR that outputs either a CSV or Excel file and read it with KNIME.

KNIME has both an “XLS Reader” node and a “CSV Reader” node.

Alternatively, you can do a search for your keyword and using something like the SERPS Redux bookmarklet, grab a list of ranking URL and input them manually using the “Table Creator” node.

Inputting SERP Data with an API

A better way of inserting search result data into your KNIME workflow would be to use a rank checker with an API access, like Authority Labs or getSTAT.

If you’re familiar with using APIs, than the two main nodes you will need are the “GET Resource” node and the “Read REST Representation” nodes (see the above slide).

Working with an API is one of the more difficult things I discussed in my presentation, so you may want to grab a developer on Elance or something to help you out with this step if you’re having difficulty.

I am personally using a weird set-up where a Python script that updates a Google Spreadsheet which is getting ranking data, and extracting that information using this weird built-in SQL-like query language. I don’t recommend doing this 😉

Extract Plain Text from Websites in KNIME

Now you have a list of URLs from Twitter and from the SERPs. The next step is to crawl those pages and get them into a plain text format.

There are a number of ways to get a webpage into a plain text format.

KNIME even has a built in “ContentExtractor” Node that makes use of the Readability API under the Palladian Community Nodes, but it doesn’t work that well in my experience.

I found that BoilerPipe, a Java Library with a web API interface works best:

The agency that I work for is lucky enough to have a developer and we created a native KNIME node for BoilerPipe, but unfortunately I am unable to share it.

On the bright side, the free API works quite well and can be incorporated into KNIME as well. The only limitation is you might get some timeouts if you are hitting their server too hard and too frequently.

I’ve provided you a meta node that makes use of the BoilerPipe web interface, that you can incorporate into your workflow:

The output you will get from feeding a URL into the BoilerPipe meta node will look something like this:

It effectively extract the main content of a page and distils it into plain ASCII text.

Before we can do any text analysis in KNIME, we have to do a quick intermittent step and put everything into the correct format. Any text must be converted into the “Document” format in order to be collected within our Document corpus for anlysis.

To do this, we will use the “Strings to Document” node. It can be attached to any plain text data in KNIME, such as webpages we have converted to plain text using BoilerPipe or Tweet text.

From here, we can work some KNIME magic!

Useful Nodes for Text Analysis Worth Mentioning

There’s a lot of text mining and Natural Language Processing built into KNIME out of the box. I won’t have time to cover all of it’s capabilities, nor am I an expert.

I do however think it’s worth mentioning a few that come in use very frequently within the context of semantic keyword research and marketing in general.

Nodes with definitions:

The Bag of Words Creator node: a Bag of Words is “a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity”.2 It’s necessary to get text into a Bag of Words model in order to do a lot of analyzes.

The Ngram Creator node: An N-gram is a “contiguous sequence of n items from a given sequence of text or speech”.3 If we want to examine the occurrence of various text segments or phrases, we need to look at it by multi-word segments. To do that, we examine the text by N-grams. If you’ve ever played with Google’s Ngram Viewer, you know how powerful this can be.

The POS tagger node: POS stands for Parts of Speech–“A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word (and other token) such as noun, verb, adjective, etc”.4 It’s really helpful to understand how language is being used to talk about your topic.

The OpenNLP NE tagger node: This node can be used to isolate “Named Entities” in text. I don’t go into how to use this, but the usefulness for doing semantic keyword research is apparent, since you can easily extract entities such as persons, organizations, locations, expressions of times, quantities, and monetary values.

Note: As an alternative to the OpenNLP NE tagger node, I would also consider exploring the incorporation of the AlchemyAPI AlchemyLanguage API into your workflow, for even better entity detection. It is freely available for non-commercial use with credit given. You can incorporate it pretty easily using the “REST Nodes” which can be found under the “Community Nodes” node section within KNIME.

THINGSABOUTTOGETMUCHTECHNICAL.GLORIOUS.

Parts of Speech Tagging

As mentioned above, it is helpful to use parts of speech tagging to understand the language of your topic.

Want to understand how people topic about a product that your client sells?

Drop in the information from Twitter and ranking pages from the SERP and examine the Adjectives!

Topic Modeling in KNIME Using LDA (Latent Dirichlet Allocation)

For a good explanation about how LDA works, I recommend giving this post a read. Although it isn’t necessary if you understand the goal of performing an LDA analysis:

LDA is an excellent way to start looking at a large number of keywords from various sources and understand which ones relate to each other and which ones don’t.

It fits very well into the topic bucket model I explained at the beginning of my post, with each topic modeled by LDA represented as a keyword bucket.

It is really easy to conduct an LDA analysis in KNIME. There is a built-in node called the Topic Extractor (Parallel LDA) node.

The only limitation of LDA is that you have to estimate how many topic buckets for the model to identify.

I usually set it to the default of 10 topics and see related the output looks and then adjust accordingly.

Types of Visualizations That Are Useful for Examining TF-IDF, Co-Occurence, and LDA

There are two main visualizations which I have found to be very useful with enhancing your boring, Excel based keyword research template and conveying semantic attributes of you keyword suggestions.

Word Clouds

If you need to examine data containing keywords and any sort of weight or frequency metric (such as TF-IDF or even maybe Co-Occurence) than ye olde word cloud is a very sensible visualization.

It shows the actual keywords.

The keywords weight can be displayed with the size of the word.

Furthermore, color can be applied to segment the keywords beyond frequency or weight. Using the “Color Manager” node and feeding into the “Tag Cloud” node (the default word cloud visualization node built into KNIME), we can apply different colors to different keyword types.

So, if you are segmenting your keywords by parts of speech or entity type, than a color-coded word cloud makes a whole lot of sense.

Network/Node Graphs

If the goal of your visualization is to demonstrate a connection between two or more elements, than a node graph is a very logical means of doing so.

For out purposes, these connections represent semantic relationships between topics via keyword trees and clusters.

Displaying keywords according to LDA or Co-Occurence represent simple examples of this visualization.

When we display LDA with a node graph, we can easily see how several keywords cluster around a certain theme.

This also true when visualizing Co-Occurence in a node graph, except we pay special attention to thick-forest-like clusters.

More densely connected keywords have a greater number of co-occurrences with other keywords and may represent a stronger connection with your theme.

How certain clusters inter-connect with other clusters is also something to pay attention to when visualizing co-occurence this way. The greater the number of inter-connected clusters, the greater the relevance–much more so than on the individual keyword level.

To create a node graph, you need three built-in KNIME nodes:

The Network Creator node – You will use this to initiate node graph creation. It doesn’t do much for our purposes but create an empty node graph.

The Objected Inserter node – You will feed your keyword data (LDA or Co-Occurence works) and your Network Creator node into this node. Configure it to define which data represent the nodes and edges of your graph.

The Network Viewer node – Feed the Object Inserter into this node and generate the actual visualization. You can right-click and configure to choose different clustering algorithms for an optimal visualization.

Bringing it Together

So at this point, you should have a pretty good understanding of some of the things we can do in KNIME.

We can…

Search Twitter for a keyword and then collect all of the text of Tweet.

Search Twitter for a keyword, extract only the shared links from those Tweets, crawl those URLs and then scrape the text from them.

Extract the top 10 ranking pages for a keyword and then crawl and scrape text from those pages.

“Perform a query at Google for a term such as “mockingbird” and take the top 1,000 or so documents that appear in the search results responding to that search.”

Google uses the top 1,000 results for it’s analysis. We will do a simplified version and use the methodology depicted earlier to find the top 10 results for the query “mockingbird”. We’ll reduce them to plain text using the BoilerPipe set-up I’ve discussed.

“Extract most of the terms from those documents after marking where they appear on the page, and calculate scores for each of the words based upon things such has how many times they occur in a document,…”

Use the term frequency node.

“…and how close to the beginning of the document they might be.”

I didn’t touch upon this previously, but you can either choose to ignore this or you can create a system using the “String Manipulation” node and the indexOf() function built into it.

“Perform a capitalization analysis and a part of speech analysis to determine if the terms might be nouns, proper nouns, named entities, or even nuggets of information such as sentences. These might be scored higher than verbs or other types of terms within the document. Other types of analysis might also be used to determine if a term is a named entity.”

Use the POS Tagger and the OpenNLP NE tagger (maybe also AlchemyAPI). Use some Text Processing filters to isolate the nouns and entities.

“Filter out the terms that tend to appear pretty commonly on the Web using something like a term frequency–inverse document frequency (TFIDF) score for those documents to see which terms are common. The top 20 or so terms that are above a certain threshold based upon the TFIDF analysis might be kept for a document, and the rest eliminated. These remaining terms are the most significant terms in the document.”

Calculate TF-IDF on the terms you’ve filtered out and then filter it out some more based on a value threshold. You can use the Rule Base Row Filter or Row Filter to accomplish this.

“Then calculate relationships scores for the terms left over in each document. Words that interact in a document by being in close proximity to each other are said to have a relationship. A close proximity might be seen if the words appear in the same sentence, or the same paragraph, or within a certain number of sentences from each other. These are local term relationships. If one of the remaining terms has no local term relationships with any of the other terms, it is disregarded.”

Use the term co-occurence counter to node to calculate co-occurences. Use multiple instances of the node, performing a co-occurence at different levels such as the sentence, neighborhood, or document levels. These can be configured within the node.

Use the row filters to filter to certain co-occurence threshold.

“A score for each of those documents can be generated by looking at which documents have terms in common, and among those documents with common terms, and something like a combination of the original ranking score and a document score based upon all of the term relationship scores within each document.”

Do some calculations to get at this, the Math Formula node will help you work with this data better.

Throw the results into a node graph. The visual output of the node graph will help you actually use this data for keyword research purposes.

Awesome, right!?

Using our Output Visualizations…

Let move on to some of the ways that the various built-in KNIME data visualizations can help us interpret our data…

Parts of Speech Output

We’ve applied colors that represent either Parts of Speech or different entity types.

As we’re writing content, we can easily look at this graph and sprinkle in an extra adjective or commonly associated entity.

Or, we’re fleshing our keyword research document and we want to expand our long tail keywords, we can look at these words and try to combine them together for more ideas.

TDIDF + Co-Occurence Output

As previously mentioned, we are looking for two things: 1) individual keywords clusters, and 2) keyword clusters connected to other keyword clusters.

For the example above exploring “Night of the Living Dead”, we’d pay special attention to these two highlighted clusters.

For cluster one…

Doing a little bit of Googling, we find that the subject matter pertains to horror movie convention that some of the cast of the movie attended that generated some buzz. If we were writing a website about Night of the Living Dead, we might want to write a blog post about that convention.

For cluster two…

This cluster amalgamates multiple smaller clusters and is probably the most representative of our overall subject matter.

The bottom of the cluster has something to do with George Romero creating the zombie movie genre with Night of the Living Dead (there is some associated junk about a blog post that we can ignore).

The mid-region of the cluster, we see discussion of a specific scene about a recent remake of the film called Night of the Living Dead 3D (it wasn’t very good).

Toward the top of the cluster we have a region that is much less densely connected. Doing some research, we find that this pertains to a comic book series being created about the movie. The same company is making a Pacific Rim comic as indicated by a few tiny branches of the cluster.

These are all topics we should consider exploring for our Night of the Living Dead fan site!

TFIDF + LDA Output

We’ve filtered to “important” keywords and performed a topical analysis of them using latent Dirichlet allocation, isolating 10 different topics.

Each identified topic is visualized in the node graph as a keywords spiral, each labeled topic_#.

For example, spiral #1 above (labeled as topic_6) pertains to the various sequels to Night of the Living Dead, both Dawn of the Dead and Day of the Dead are mentioned. A page devoted to Night of the Living Dead sequels would be an excellent page for our fan site.

Spiral #2 above (labeled as topic_8) pertains to how Night of the Living Dead was selected by the Library of Congress for preservation in the National Film Registry. Another excellent topic to discuss for out fan website!

Now go forth upon the world and start doing better semantic keyword research!

You should now have a basic understanding of an awesome tool and how you can use it to start doing better, more tangible semantic keyword research.

There’s a lot that I didn’t have an opportunity to cover.

So if you have any questions about what I’ve covered or didn’t, ask away in the comments below!

Write a Comment

13 Comments

There are more than one versions you can download. You want to download the version of KNIME that includes all of the extensions. Alternatively, there is a way to download individual nodes, but I recommend just grabbing the more comprehensive version out of the box.

Great stuff man. Keep it coming. A semantic enthusiast such as yourself and this article hits the nail on the head. It’s almost like I don’t want to share it with anyone because then I would slowly become irrelevant 😉

Paul, what an excellent article! I was searching for a practical approach to something like semantic keyword clouds and as far as I can tell, this is the sweet spot for it. However, after downloading & installing the complete KNIME and giving myself a try at the ‘extract and crawl twitter links’ workflow, I run into trouble with three nodes, that do not seem to be included in the node repository anymore
– HttpRetriver
– UrlExtractor
– UrlResolver
Does this also happen to you, or am I missing an additional node package I need to install, anything?
Thanks in advance.

I’m running the latest stable version (2.1.2.0) and they’re included. My guess is that you downloaded the Early Access 3.0 version? Or downloaded the version without all of the extensions? Another thing that’s possible, is that they’re included in the 32bit version and not the 64bit versions?

Alternatively, these are all under the Community Nodes->Palladian section and you should be able to download them under Help->Install New Software… menu.

Hello Paul, thanks for the reply and help provided. Indeed I had Version 3.0. I downgraded to yours to have the exact same setting. All nodes are there, but… the ‘Extract and Expand Tweet URLs’ workflow still has issues. I went through each node / process step, executed, checked the result table.
– String Manipulation: Part 1:t.co Extraction RegEx
– String Manipulation: Part 2:t.co Extraction RegEx
In these nodes I adapted the config to work on https which was http before and the result table looked fine afterwards.
– URLResolver
This simply does not seem to work anymore, the ‘Resolved URL’ column is exactly like the input column.
So unfortunately, at least so far, this is the furthest I could get. If you have any advise, highly appreciated.
Cheers, Kai

Thanks! Once you’re familiar, and you have stuff built out, the time investment is VERY minimal. Even if you don’t have everything built-out, and you’re familiar, it’s amazing what you can accomplish in a short period of time.

Great post, Paul. Thank you very much for your insight. Unfortunately, I get lots of 403s that prevent me from extracting text. I am assuming BoilerPipe does not want to process a batch the way it is setup in your example. Please let me know if you have alternatives that were not mentioned in the original post.

Yeah, an alternative would to be to integrate the actual BoilerPipe library, which unfortunately isn’t easy if you’re not familiar with Java. I’ve written the code if you want to figure it out, it’s on my GitHub.

Terrific writing and really impressive way to make things clear. Topical and intent based keyword research is very important these days and LSI keywords are playing major role after the Hummingbird update. But the newest change on Google keyword planner tool is making things difficult to select best relevant keywords depending on their average monthly search volumes. Really enjoyed reading the post and couple of valuable points are also noted.