What about the smart data? IoT and data quality

Smart data is the fundamental pre-requisite for ensuring smart cities and smart homes in the new Internet-of-things world.

The Internet of things (IoT) is going to bring massive changes to the way we live in the next few years, bringing the benefits of smart homes & smart healthcare to individuals, smart cities to societies and smart factories to manufacturers, among others, all of which will have knock-on effects. For instance, for the insurance industry to be able to charge competitive rates for car insurance, health insurance and house insurance will require IoT data from smart vehicles, health apps and homes.

To bring these benefits, IoT generates tremendous quantities of big data collected through sensors, sent to a central repository via a WiFi connection and then analysed, often in real time, so the information they produce can be used in smart ways. Those who have to deal with this data are going to face tremendous challenges in ensuring that the data they have are of a sufficient data quality so that the smart products can seamlessly and swiftly do what they are meant to do and record what they are intended to observe. There are few products which will not require IoT technology within the next decade, and given the sheer diversity of these products a whole variety of different companies will produce the sensors for these smart things and an equally wide range of organisations will need to analyse the data generated, resulting in an inevitable lack of standardisation. Quite apart from any actual errors which occur within these processes.

The need for smart data quality

The question we at Spotless Data thus want those involved in developing and implementing this technology and analysing the data produced by IoT to ask themselves is "And what about the smart data?"

Smart data requires data validation and data quality to be of any use. This involves checking whether the data are accurate, data cleaning to correct any errors and inconsistencies within both the data and the metadata, and data integration to integrate the data into whichever platforms they are entering, with the same data typically entering multiple platforms of multiple organisations.

So if we take the example of monitoring traffic in a city, the collected data will be from multiple sources which include smart vehicles, smart street lighting, smart car parks, smart roads, smart traffic lights and smart pollution monitoring sensors. Let us imagine a driver goes to where an IoT device communicating with her smartphone or directly with her vehicle says that a car parking space near to where she wants to go shopping is currently available, but when she gets there she finds that there is no parking space there. Or a smart traffic system which, far from ensuring that the roads remain congestion-free, is creating bottlenecks due to the inadequate quality of the data it is dealing with, resulting in cars stuck in traffic jams for longer than was the case before the implementation of the IoT technology. The cause of these problems will be that the data are in such a state that they can only be described as rogue data, drawing false pictures, in these cases of car parking availability and traffic jams. The data are in need of data validation through data cleaning to ensure their quality is good enough in real time to complete these tasks.

Excellent data quality would mean that the shopper does find a parking space where the IoT device tells her there will be one and the city's traffic is experiencing fewer traffic jams than at any time since the 1950s.

Spotless has some great data validation features to ensure that your IoT data will function correctly, including uniqueness solutions for dealing with unusual types of rogue data, session solutions to clean both gaps and overlaps in any time series and lookup solutions to find best matches and to fill in blanks. With Spotless data built into the entry point of your data platform the data can be cleaned as they arrive at your repository and then swiftly analysed by your analytics software to give the valuable information required in real time. The best analytics software in the world is only as good as the data it analyses, so it makes no sense to spend a fortune on the best analytics software while ignoring the quality of the data.

Collecting metadata from different providers

Ingesting data with varying metadata tags is a classic recipe for turning a data lake or similar repository into a mudpit. The individual smart products will still work, but they might as well not do. The data that a single organisation such a city council uses will come from multiple different providers who will use various metatags to describe the data. If the smart car park sensors are using one set of metatags, the smart traffic lights an entirely different set of metatags and the smart vehicles themselves a third set of metatags this is a recipe for chaos. And these data will not solely be used by the local council as the car parks may be privately owned, and the car park owner may use a different set of metatags in her data lake from those used by those tasked to ensure the roads remain congestion-free. Meanwhile, the various companies which insure the cars to the companies which (very shortly) will drive the cars to the companies which charge road tolls (as happens in many capital cities already) are all likely to use different metatags. Standardisation is unlikely to happen anytime soon.

There are no right and wrong metatags for those collecting, storing and analysing the data but what IS required is metatag consistency. Fortunately, there is now a simple solution to this problem, which is Spotless Data's machine learning filters, easily accessed through a python API. We specialise in metatag issues as we have long recognised how vital consistency of tagging is. By passing your data through our API at the point of entry to your data storage all the different metatags which do not conform to your own metatag system can be rapidly modified so that they now have perfect consistency. And as our API is based on Machine Learning it will soon recognise the metatags which your organisation requires.

Collecting data from IoT devices

Apart from the metatags is the data itself, which again will face this same problem of inconsistency due to the way that different sensors produced by various companies will produce data that appears inconsistent. As IoT starts to take off those who are storing and subsequently analysing the data will find that they have to deal with 40, 50 or more different sources of inconsistent data. As long as these data, or the metatags which underpin them, fail the basic quality test of consistency the business intelligence extracted from them by the analytics software will give information that, far from being useful, is worse than useless.

However, inconsistency in the structure of the data is not the only issue. Other issues include data which is just wrong, such as from a faulty monitor that is perhaps supposed to be measuring air pollution but which always gives the same result, and isn't consistent with the variations found in other air quality control monitors in the area, because it has broken. A broken sensor in itself may be hard to detect but, because of anomalies in the data, using Spotless is likely to mean the problem does not escape detection. Spotless is particularly good at spotting anomalies in data, such as blanks and overlaps, the resolution of which will allow IoT data to function much more smoothly and thus prevent any anomalies from producing short-term chaos within the data or its analysis.

Accumulated data from IoT process not only have value in real-time but can then also be used by say a city council at its leisure to work out if new car parks or car parking meters on certain streets are needed or not, or if there need to be restrictions on car usage in certain areas or within certain hours in order to say keep air pollution at safe levels. All these require that the data are of sufficient quality that they can be used not merely for the short-term purpose for which they were originally designed but also for these new, longer-term purposes. Our own definition of data quality is that it should be fit for any purpose, not merely for what it was initially designed to do. The multi-faceted nature of IoT is a great example of why data needs to be fit for any purpose. So a large quantity of data which has already served its short-term purpose can then also can be analysed to give meaningful reports which can then be used, say by city politicians, to make difficult political decisions based on facts derived from quality data and not merely on suppositions.

Data cleaning IoT data with the Spotless Data solution

You can read our introduction to using our API and then try out our service on your my filters page but, for this, you need to log in first. If you haven't already done so, you can sign-up using your email address, Facebook, Google or GitHub accounts. You may also view our video on data cleaning an EPG file, which also explains how to use our API.

We use the https protocol, guaranteeing your data are secure while they are in our care and that they are not accessible by any third party, a responsibility we take very seriously.

Here is a quick link to our FAQ. You can also check out our range of subscription packages and pricing. If you would like to contact us you can speak to one of our team by pressing on the white square icon with a smile within a blue circle, which you can find in the bottom right-hand corner of any of the web pages on our site.

If your data quality is an issue or you know that you have known sources of dirty data but your files are just too big, and the problems too numerous to be able to fix manually please do log in and try now