Introduction: a simple explanation

In its simplest form text analysis is just counting words. The interesting part is how you choose what to count and what you do with it.

Text analysis is often used synonymously with text mining and text analytics, though many computer scientists argue about subtle differences. It’s a popular subject now and in its various forms, it is used in many organisations. However, when we talk about text analysis, we are usually discussing a collection of methods which count words and analyse their relationships with each other. This large number of methods can be grouped under two basic approaches:

specifying individual lexical items (words, phrases) and observing patterns. This is often manual as you know, to an extent, what you’re looking for.

the second way is enabling a computer to find the relationships for you. To varying degrees, this can be automatic with minimal human input.

There are obviously pros and cons to both approaches but bear in mind that computers are very good at finding things that we as human readers often don’t notice. There’s an old saying in computing circles that if you didn’t need the computer to find it, it probably isn’t a very meaningful discovery.

What does Text Analysis involve?

Text analysis is really the process of distilling information and meaning from text. For example, this can be analyzing text written in reviews by customers on a retailer’s website or analysing documentation to understand its purpose. The process aims to examine the texts and find themes and trends that can enable the business to take strategic action.

Text analytics can be done manually with one person and an excel spreadsheet but at scale, this can be time-consuming, inefficient and inaccurate. So organisations will often use the software, leveraging machine learning and natural language processing (NLP) algorithms to find meaning in enormous amounts of text.

What are the sources for text analysis?

There are a number of places both privately within the organisation and publicly outside which will generate text that can be organised. Although most organisations may see or even collect data in a structured and semi-structured way there is also a huge amount which is unstructured- making it difficult to analyse and more valuable.

Internal sources include call logs to customer service, letters, emails, and responses to structured feedback, like post-purchase email requests. It can even include legal data such as compliance forms, customer contract agreements or even customer location data. It can also include verbatim data from market research (usually done via an agency) and customer surveys like NPS scores. Many of these are structured or semi-structured but even then will be held on different databases or systems which frequently cannot interact.

External sources may include data held by channels or independent platforms. This can include sales data but also unsolicited feedback on review websites, app stores, even Facebook and Google Play.

Not only is much of this unstructured it is also unsolicited and may be left with a pseudonym or internet handle- this means that not only could there be no categorization but the language may not be uniform, or it may even be in a local dialect or use slang words and emotional language. Being unstructured it could vary between one-word responses to whole technical essays.

Why is unstructured data important?

According to a study by the International Data Group (IDG), unstructured data is growing at an alarming rate of 62% per year. The same study also suggests that by 2022, close to 93% of all data in the digital world will be unstructured. Unstructured information is typically text-heavy but may contain data such as dates, numbers, and facts as well. It includes scanned paperwork, emails, contacts, images as well as information written by customers, market research and much more. It’s vital to understand as most organizations are overflowing with it! However, due to the fact that it is often written by humans this results in irregularities and ambiguities that make it difficult to understand using traditional programs as compared to data stored in fielded form in databases or annotated/ tagged in documents.

62

unstructured data is growing at an alarming rate

93

of all data in the digital world will be unstructured in 2022

According to IBM, within the vast bulk of data that needs to be understood, text is one of the most common types of unstructured data. But by its nature, text is messy. Therefore, analyzing, understanding, organizing, and sorting through text data can be difficult and time-consuming, which means that most companies fail to extract value from it.

How are Text Analytics used by companies?

For many companies, Text Analysis is the first step in a data-driven approach towards management. By converting text sources to data, opportunities open up to embed the results of the analysis in processes like strategic decision making, product development, marketing, competitor intelligence and more. In a business context, analyzing texts to capture data from them supports the broader tasks of:

One popular method to measure customer satisfaction is through Net Promoter Score (NPS). Typically, NPS surveys include one simple, yet powerful question: How likely are you to recommend our product? This simplicity has made NPS surveys one of the most popular ways for companies to understand how customers perceive their product or service. It’s a quick, easy to understand snapshot of where you are.

Most organisations will also have some structured methods for collecting feedback, such as post-purchase emails or follow up phone calls. These methods are often quite structured and can obviously generate a huge amount of text.

You also need to take into account the responses to open-ended questions as well as unsolicited feedback – which is where the biggest chunk of insights can lie. Review sites like Trustpilot or retailers like Amazon are driven by feedback but there are also social media, web forums, independent surveys, live chats, and more. Suddenly, this is an enormous quantity of feedback data, all of which lie outside traditional NPS methodologies. Text analysis is the first step in understanding what customers really think.

Social Media Monitoring

According to Hootsuite, nearly half of Americans have interacted with companies or institutions on at least one of their social media networks. Twitter alone, users send 500 million tweets each day. This immediately tells us there is a huge volume of interaction data awaiting us! What are people complaining about? What are they praising? How are they interacting with marketing messages or campaigns? The vast explosion in structured social media data has been at least partially responsible for the interest in text analysis and is still its major application. Being able to monitor social media underpins a large number of the other use cases below. To understand the difference between social media listening and Natural language processing read this article.

Customer feedback monitoring

One popular method to measure customer satisfaction is through Net Promoter Score (NPS). Typically, NPS surveys include one simple, yet powerful question: How likely are you to recommend our product? This simplicity has made NPS surveys one of the most popular ways for companies to understand how customers perceive their product or service. It’s a quick, easy to understand snapshot of where you are.

Most organisations will also have some structured methods for collecting feedback, such as post-purchase emails or follow up phone calls. These methods are often quite structured and can obviously generate a huge amount of text.

You also need to take into account the responses to open-ended questions as well as unsolicited feedback – which is where the biggest chunk of insights can lie. Review sites like Trustpilot or retailers like Amazon are driven by feedback but there are also social media, web forums, independent surveys, live chats, and more. Suddenly, this is an enormous quantity of feedback data, all of which lie outside traditional NPS methodologies. Text analysis is the first step in understanding what customers really think.

Brand monitoring

81

of buyers conduct online research before making a purchase

78

of consumers trust online reviews as much as personal recommendations.

Analysis of brand mentions allows you to keep up to date with word-of-mouth credibility within the industry, identify potential reputational risks and to quickly respond to them. This can be tracked over time to understand how your brand perception has changed in context with competitor news or social changes. According to MineWhat, 81% of buyers conduct online research before making a purchase. Consumers care about what people are saying online about a brand; BrightLocal states that 78% of consumers trust online reviews as much as personal recommendations. Part of this is through social media but also specialist internet forums, Google map comments, even local news have to be considered.

Knowing social opinion about your brand is vital, but it’s also important to know who is talking about you and in what context. Advanced forms of analysis such as sentiment analysis can understand the tone of language that can also help define who and why influencers mention your brand. Sentiment analysis software can do this in real-time and across multiple channels- we will discuss more about this below.

Competitor and market research

With Text Analysis, you can easily track and research not only how customers feel about you what they feel about your competitors or even what your competitors feel about you. What do customers value about other industry players? Why would they choose a competitor over you? What do competitor offerings lack? Which channels do clients use to engage with competitors? Any of this knowledge can be used to improve your communication and marketing strategies, as well as develop your customer service operations, even becoming a more customer-centric company.

Marketers can also track consumer behavior in real-time to assess and understand future trends and help management teams make informed long term decisions. Best of all, the data is already there- it doesn’t require the long process of commissioning new market research, and can be understood in real-time.

Customer service prioritization

Industries that rely heavily on incoming customer contact, like retailers or financial institutions or transport firms use text analysis to optimize their customer service work. With text analysis platforms the business can automate the classification of inbound messages by polarity, topic, subject matter, and priority. Then queries can then be easily escalated or sent to an appropriate specialist. For example, new messages from most angry customers may need to be processed first whilst questions about hardware faults may be sent to a specialised team.

Product development

All successful companies want to understand how product launches are working- they aim to collect early feedback so they can start optimising products even straight after release. To understand how customer feedback can save a product launch read this case study. Customer feedback data can come from surveys but also social media, internet forums, or inbound calls for support.

Text analysis allows the business to sort comments by topic or sentiment, to establish which product features are most or least critical or even how product messaging and packaging work. With this information, not only can product launches be optimised, but the business can confidently start the process of product optimisation.

Workforce analytics

Text analysis can also be applied internally for HR-related processes. Most large companies measure employee satisfaction and try to isolate factors that may reduce company performance and employee satisfaction. Employee satisfaction surveys, as well as employee review data and employee surveys, can be analyzed to address problems and potential concerns faster.

Prediction

Almost any kind of prediction exercise requires a large amount of base data to analyse and then test forecasts against- text analytics can be the basis for this. For example, you might want to forecast US economic performance. Text analysis would allow you to scan SEC filings for key text, cluster related terms and determine which are causal. The same could be applied to news feeds, central bank data, airline flights, gasoline prices, avocado purchases… in fact anything! At a more fine-tuned level, the same principle can be applied to assessing the brand impact of product launches or even forecasting the effect of competitor activity.

Subscribe to our newsletter

Join hundreds of professionals to learn how you can become more customer-centric. You will receive updates on our latest blogs and videos, access to exclusive content, and more!

Notice: JavaScript is required for this content.

View our privacy policy for details on use and storage of your personal data

How does text analysis work?

There are a number of different methodologies that depend on the level of depth the user may need as well as the source data type. The methodologies are interlinked in many ways and most software vendors will use a cocktail of tools.

The actual task is usually automatic or semi-automatic analysis which aims to find patterns, unusual records or dependencies. These patterns can then be seen as a kind of summary of the input data and may be used in further analysis or, for example, in machine learning and predictive analytics.

We’ll go through some methodologies but this list is by no means exhaustive.

Word Search

This is by far the most simple form of text analysis and is the one used by most companies when they begin to explore text analysis.

Put simply, if a word appears in the text, we can assume that this piece of text is ‘about’ that particular word. For example, if words like ‘price’ or ‘cost’ are mentioned in a review, this means that the customer is concerned with ‘price’. The same kind of analysis can be done using any attributes important to your product, for example, speed or weight. You can then expand the list using a thesaurus and use the search functioning whatever program holds your data (even Microsoft word) to find the relevant reviews.

Let’s take the following three reviews you might find in feedback or reviews for a mobile phone purchase:

“Glad I got the 64GB version. Easy to use and worth the price”

“I find my data bills are expensive all the time”

“Much faster and cheaper than T-Mobile”

All three are connected to price. However the second is actually about billing. The third is a relative statement, comparing performance to a competitor. So a simple search on price misses the actual message in two out of three! Simple searches like this may work over a small number of feedbacks but if you work at scale will miss key messages.

Nevertheless, you may not need to have further sophistication. In this case, there are some smart ways of applying word searches such as Google’s Ngram or competitors like Overview or Voyant.

Word search on Google’s Ngram, with the word searches for Frankenstein, Albert Einstein, and Sherlock Holmes.

Manual Rules

The Manual Rules approach to text analysis is closely related to the word spotting above. Both approaches operate on the same principle of finding a match, but with manual rules, you can allow for complex situations.

For example, if you want to search through reviews which may be concerned with ‘returns policy’ you may create rules based on words beginning ‘return’ or ‘fault’ in relation to the product. Rules could also examine word order and the grammatical relationship between keywords. Since each rule has to be created manually, the setup process is lengthy, but the results are highly precise and tailored to your needs- you get exactly what you ask for.

Importantly, Manual Rules can be easily understood by a human.

This means that over time, they can be adjusted as more is learned about their use. Rules may also be the same across products and companies in the same sector so many vendors already organise the rules in taxonomies which they can drag and drop into different situations. For example, if you are compiling data from an employee survey, there would be a category named ‘Reward’ which would include subcategories such as ‘salary’ or ‘bonus’ or ‘vacation’. Similarly, categories may be preset for the insurance industry, for app store feedback or gym memberships and so on.

The approach has many benefits but there are also plenty of disadvantages too.

Firstly, multiple word meanings make it hard to create rules. The most common reason why rules fail stems from polysemy, when the same word can have different meanings. For example, take a typical piece of feedback such as:

”We came here for lunch at 1 pm but even though the queue was huge we didn’t wait too long and the server found us a table promptly

This comment is not a complaint about how busy they are but a useful positive comment about staff.Capturing the sentiment of this statement with manually pre-set rules is impossible. Similarly, a person could write:

”I did not think this product was too expensive

But to understand it through rules you would need a complex algorithm to spot and understand negation and its scope. English isn’t the only culprit here; “C’est pas mal” means “It’s not bad” in French which is a widely used idiom for “It’s good”. However “C’est pas terrible” means “It’s awful”.

Secondly, for many businesses, there is no preset taxonomy. This is a particular issue in the software industry, where each product is unique (and in fact may even be a service!) and customer feedback from specialist users tends to home in on specific technical issues.

Lastly, Even if you have a working rule-based taxonomy, someone would need to constantly maintain the rules to make sure all the data is categorized accurately. This person would need to be a linguist constantly scanning for new expressions that people create. Long term this can be an expensive proposition. You may also need to consider that there are some surprising upper limits to accuracy with any human-based system.

Text Categorization

This approach is powered by machine learning. The basic idea is that a machine learning algorithm analyzes previously categorized examples and figures out the rules for categorizing new examples. You simply need to provide examples, so no manual creation of patterns or rules needed. This approach is semi-supervised one as the algorithm trains itself using this data but still need to be fed and corrected.

One advantage of text categorization is that it can capture the relative importance of a word in a text. Looking at our earlier feedback example

“We came here for lunch at 1pm but even though the queue was huge we didn’t wait too long and the server found us a table promptly”

This is not about lunch quality or queue length, it’s about the staff service. The mentioned theme “lunch” relates to “food” which is, of course, important for a restaurant but here actually isn’t important at all. Instead, the “server”, “found” and “promptly” are. A text categorization approach can capture this with the right training.

The downside of this method is the quality of the training data. Research shows that text categorization can achieve near-perfect accuracy. New Deep Learning algorithms are even more powerful than the old naïve ones. But all researchers agree that the algorithm isn’t as important as the training data.

What’s more, the more categories you have the more training data is needed to help the algorithm function correctly. This method can work and can be done cost-effectively over time. One major downside is that many organisations cannot wait.

Topic Modelling

Topic modelling is also a Machine Learning approach, but an unsupervised one. This means that this approach learns from raw text. Data scientists who use this term are usually referring to a particular algorithm called LDA (Latent Dirichlet Allocation). This is a mathematical model of language that captures topics (lists of similar words) and how they span across various texts.

As an example let’s take an insurance company’s reviews on an independent review website. Each review is assigned a number of topics and the topics can also be weighted. For example, a customer comment like “telephone support is terrible I had to write four emails” could have weights and topics as following:

50% support, service, staff, people

40% bad, poor, terrible, weak

18% number, phone, email, call

By scanning and understanding, the importance of the words in the text the machine can evaluate what’s in the review, what the reviewer thinks under various topics that are weighted and linked. Topic modelling works with no input other than raw customer feedback. The learning happens by observing which words appear alongside others in reviews. This information is captured using probability statistics. This is a deeply mathematical process as this Wikipedia article shows.

When used for feedback analysis, topic modeling has a number of disadvantages: firstly, in its nature, it’s not good with very short texts common in customer feedback or social media.To read more on how enterprises should analyze customer feedback at scale follow follow this link. If you have really short documents, like tweets, it’s hard to break documents into topics which is an area of considerable discussion in data scientist circles. Secondly, it still doesn’t understand sentence structure very well which can make it weak for long argumentative texts like employee feedback.

Sentiment analysis

Sentiment analysis is another step up in text analysis complexity as it allows you to automatically identify the emotional tone in a text.

Thanks to Natural Language Processing (NLP), it is possible to create systems that are able to understand the opinions in conversations and obtain insights about products or services. Sentiment analysis is hence an automated process that uses AI to identify positive, negative and neutral opinions from text. usually, besides identifying the opinion, these systems extract attributes of the expression e.g.:

Sentiment analysis can be applied at different levels of scope from document and sentence level down to sub-sentence level sentiment analysis which obtains the sentiment of sub-expressions and clauses within a sentence.

There are many types of sentiment analysis.

Fine-grained Sentiment Analysis: this allows refinement beyond positive/neutral/negative. This can be extended Into as many categories as needed with added refinement including the type of polarity such as anger or enthusiasm

Emotion detection: Emotion detection aims at detecting emotions like happiness, anger and so on Many emotion detection systems often use pre-packaged lexicons i.e. lists of words and the emotions they convey) or there could use complex machine learning algorithms.

Aspect-based Sentiment Analysis: For example, the comment at the end of a positive review like “…but the processor could be more powerful” is complex to understand. This is a negative opinion about one aspect of the object and uses an auxiliary verb “could” to talk about a desire for more power. Aspect based analysis would home in on this kind of expression about an aspect of the product.

Intent Detection: This looks for action behind the sentiment For example take this tweet: “I’ve been on hold for 20minutes!” Sentiment analysis can identify there is an action required and can effectively head off an angry customer.

Information Extraction

This is the first step in the process of evaluating unstructured data. This first involves finding the data and cleaning it of unnecessary punctuation, typos and so on. This then involves tokenization and identification of named entities (a named entity is a real-world object, such as persons, locations, organizations, product), key phrases and parts-of-speech. It then uses the concept of pattern matching to find any predefined sequences existing in the data. You can then identify the relationship between entities and attributes. The endpoint is to have a database that is at least semi-structured.

Categorization

Then we can develop an input-output principle where the system is given inputs regarding the pre-defined categories under which the data in the new documents are to be classified. This is a statistical step and uses sophisticated techniques such as Nearest Neighbour classifier, Decision Tree, Naïve Bayesian classifier, and other statistical classification techniques. We will end up with some categories assigned automatically to the text

Clustering

This is again a sophisticated step where we bring together groups of documents that have similar content and generate multiple groups of documents known as clusters. The content of documents in a specific cluster is similar while that of documents in different clusters should by definition be quite different. This technique works on semantics – the principle on which semantic search engines work.

Analyzing multi-language customer feedback in large volume from different sources is a complex process. With an AI-based technology and years of experience, Wonderflow is helping global brands to become customer-centric. Find out more about our solution.

Visualization

Most of the complex statistical and AI work has been done at this stage. Here we use visual cues such as text flags to indicate documents or document categories and develop other ways to identify and indicate the densities of a category, entity or phrase. The user needs to have an overview as well as to be able to zoom in without losing data. If the original source was large, some kind of visual hierarchy will also need to be developed.

At this stage, we begin to integrate with existing business processes to understand how and who will need to use the knowledge developed.

Summarization

The objective here is to generate a summarized version highlighting the information that will be most relevant to the user. This in itself requires the application of algorithms to summarize text automatically with a semantic engine to ensure the meaning isn’t lost.

Conclusion: Text analysis

As we can see text analysis is a complex and potentially powerful tool. But, its sophistication can result in cutting edge new analysis.

It can offer incredible insights into social media sentiment, workforce performance, competitor intelligence and more. This is at least partly because social media and the public internet has created a huge body of text that can be automatically analysed and understood.

The benefits of effective text analysis are so huge that If you’re involved in an organisation that assembles large volumes of data or has an interest in understanding information, social media or feedback sources you likely have the basis for building a business case for text analysis.

Text Analysis Methodology

As we’ve seen above text analysis can involve a number of concepts ranging from drag & drop into Excel to NLP. Whatever level of complexity you choose, its application and use involve some practical steps which need to be considered:

Information Extraction

This is the first step in the process of evaluating unstructured data. This first involves finding the data and cleaning it of unnecessary punctuation, typos and so on. This then involves tokenization and identification of named entities (a named entity is a real-world object, such as persons, locations, organizations, product), key phrases and parts-of-speech. It then uses the concept of pattern matching to find any predefined sequences existing in the data. You can then identify the relationship between entities and attributes. The endpoint is to have a database that is at least semi-structured.

Categorization

Then we can develop an input-output principle where the system is given inputs regarding the pre-defined categories under which the data in the new documents are to be classified. This is a statistical step and uses sophisticated techniques such as Nearest Neighbour classifier, Decision Tree, Naïve Bayesian classifier, and other statistical classification techniques. We will end up with some categories assigned automatically to the text

Clustering

This is again a sophisticated step where we bring together groups of documents that have similar content and generate multiple groups of documents known as clusters. The content of documents in a specific cluster is similar while that of documents in different clusters should by definition be quite different. This technique works on semantics – the principle on which semantic search engines work.

Visualization

Most of the complex statistical and AI work has been done at this stage. Here we use visual cues such as text flags to indicate documents or document categories and develop other ways to identify and indicate the densities of a category, entity or phrase. The user needs to have an overview as well as to be able to zoom in without losing data. If the original source was large, some kind of visual hierarchy will also need to be developed.

At this stage, we begin to integrate with existing business processes to understand how and who will need to use the knowledge developed.

Summarization

The objective here is to generate a summarized version highlighting the information that will be most relevant to the user. This in itself requires the application of algorithms to summarize text automatically with a semantic engine to ensure the meaning isn’t lost.

Conclusion: Text analysis

As we can see text analysis is a complex and potentially powerful tool. But, its sophistication can result in cutting edge new analysis.

It can offer incredible insights into social media sentiment, workforce performance, competitor intelligence and more. This is at least partly because social media and the public internet has created a huge body of text that can be automatically analysed and understood.

The benefits of effective text analysis are so huge that If you’re involved in an organisation that assembles large volumes of data or has an interest in understanding information, social media or feedback sources you likely have the basis for building a business case for text analysis.

How a Fortune 100 consumer electronics company eliminated 90% of their customer feedback analysis process