Data Validation

When your company has valid data then all those things the data does for it will work seamlessly, like clockwork.

Data validation is the process of ensuring that the data one has are valid so that they are fit for purpose rather than being contaminated with rogue data. If you fail to validate your company's data you are taking a big risk which may backfire in a whole range of catastrophic ways.

We never fail to be surprised at the sheer quantity of errors to be found in data, at times sourced to respectable multinational corporations, and caused for a whole variety of reasons. Then on top of this are the problems created by blending various data sources into a single data repository (such as a data lake or a data warehouse) where lack of consistency just compounds the already real problems created by errors.

800 pages of data on one person

We know that businesses simply cannot afford to get their data wrong, whether it is for legal compliance, displaying on a website, collected from Internet of things sensors or used for marketing and sales campaigns and internal business intelligence and reporting. When a journalist recently received 800 pages of data about her year as a moderately heavy user of a dating website most people just stayed with the thought of how much data one website had on one person. Yet as of next year, within the European Union, such data must be provided legally to anyone who requires it, and this is a trend that is more than likely to extend worldwide. The question we asked ourselves is how many errors were there within that 800 pages of data? Not that we think that dating sites are more likely to produce data errors than any other company. Indeed they may be one of that minority of companies which already have the processes in place to effectively validate their data and this guarantee that they are error-free, not merely for this one customer but the many thousands if not millions of customers they have on their books. But neither it appeared did the journalist go through those 800 pages with a fine toothcomb looking for errors. However, if somebody were to do so and were to find rogue data errors that violated some of these strict new data and data privacy laws such as GDPR it could have given that company serious problems.

Data blending issues

Our surprise at seeing so many data errors in much of the data that we have passed through our machine learning filters is partly because we know what a devastating impact a single error in data can have. If a hedge fund scrapes large quantities of data from multiple sources to then let its sophisticated and expensive artificial intelligence programme first analyse and then make financial buying and selling decisions based on said data we know that a single error in the data could result in the fund losing millions of dollars of their clients money. Scraping information for later analysis is particularly vulnerable to rogue data issues because of the problems of inconsistency while blending data from different sources into one data repository. To successfully blend these data together, so they form one large whole requires that each and every one of the data are properly validated.

At Spotless Data our goal in life is simply to ensure that all the data you have are validated as they are ingested into your repository and before entering your platforms. We do this through our Machine Learning filters. They filter out rogue data, either modifying or removing them, quarantining any suspicious data. This means data that are so spotlessly clean and so fit for purpose that every last piece has been properly validated. Thus all the data have data integrity throughout their lifecycle from the moment they leave us ready to be ingested into the repository of the company which owns or manages them.

Genre data

At the heart of our data validation process is the report which you receive within less than a minute of uploading your data into our python API, easy to access as it is simply a webpage. Within this report come our suggestions as to where your data have problems which means they fail to validate and what is the best solution to fixing these problems. You can take a look at our video on data cleaning a genre column and see for yourself how many errors we found data cleaning a genre column of data from an external source which we needed to ingest into an EPG. Blanks and mismatched data can completely mess up an EPG, possibly giving people wrong information about what appears on television and when genre information is also very important to the viewer experience. Someone who thinks they are about to watch a comedy but finds they are watching a war documentary instead is likely to be dissatisfied with the EPG if it led them to believe they were about to watch something much more light-hearted. Genres are also very important for recommendations, and an EPG that recommends a news programme to its soap viewers, based on an incorrect genre, is not likely to be taken very seriously by its viewers in the future. And while a broadcast time mistake due to a failure in the data validation of numbers is likely to be spotted very quickly, hopefully before the EPG goes live or very shortly afterwards, a genre column error is much harder for humans to spot unless they have a profound knowledge of all the genres of television or a similar area of entertainment but can have an equally negative effect on users who do discover the mistake when actually using the EPG or similar.

Data validation with the Spotless Data solution

You can read our introduction to using our API and then try out our service on your my filters page but, for this, you need to log in first. If you haven't already done so, you can sign-up using your email address, Facebook, Google or GitHub accounts. You may also view our video on data cleaning an EPG file, which also explains how to use our API.

We use the https protocol, guaranteeing your data are secure while they are in our care and that they are not accessible by any third party, a responsibility we take very seriously.

Here is a quick link to our FAQ. You can also check out our range of subscription packages and pricing. If you would like to contact us you can speak to one of our team by pressing on the white square icon with a smile within a blue circle, which you can find in the bottom right-hand corner of any of the web pages on our site.

Blog posts about Data Validation

The data validation of emails is a great example of where Spotless Data's machine learning filters help make businesses more successful.
Email addresses are the bread and butter of most modern communications but frequently address databases include badly formatted addresses, old addresses that users no longer respond to and, increasingly, regulation is meaning that you cannot use emails ...

Spotless Data's machine learning filters can really help make Tableau work perfectly every time.
Tableau has the most fantastic mapping functionality but how frequently have you uploaded your geographic data only to find a large number of “nulls” on the map? At Spotless we have set up a number of solutions specifically to transform your geographic data into Tableau cities, co...

Data validation has never been so easy thanks to Spotless Data's Machine Learning filters solution.
Spotless provides three levels of data validation of URLs to check URLs in your data files and ensure their data integrity:
- Review the structure - A high-level check that a URL is well structured and clean. This can be useful when collecting manually entered URLs from contact da...

We have a lot of clients in the TV industry and one problem we’ve come across, again and again, is the inconsistent naming of TV shows. Whether it’s the same show being broadcast on multiple channels or comparing live TV schedules against VOD and OTT video services, everyone seems to have a different way of spelling TV show names.
In order to help fix this problem, we’ve de...

One of the most common problems with free-form data entry is that the data is not submitted in a standard form. This makes it hard to identify duplicated records and even harder to integrate data from a number of different sources to ensure data integrity.
For example, email addresses should always be in the form xxxx@yyy.zz and telephone numbers in the US should always have 10 digits. If yo...