Imagine that you are in a future, augmented city. The sensors around you, through machine learning scoring and artificial narrow intelligence realize that you are about to sneeze…even before you do. In response, a nearby 4D printer makes a handkerchief that feels as though it is made of the softest cotton-linen blend, and indeed those materials are part of the weave, but only a part. A variety of nano-materials make up the rest, incorporating soft sensors, and various mechanical properties that allow the handkerchief to fly to you from the 4D printer. And indeed, this is 4D, as the material properties change from a flying bird shape with powerful wings, to a soft facial tissue, landing in your hand, just in time to capture your sneeze. Now whether the sneeze was caused by some errant dust – this is, after all, an augmented city with integrated agriculture and green spaces – or an allergen, the handkerchief's sensors now analyze the sputum and mucous that you sneezed into it, just as secondary assurance that you aren't about to spread cold, flu, or more serious viral or bacterial contamination around you. The handkerchief is fully reusable and recyclable and repurporseable, to be sterilized and become a face mask fro you, to protect you from the dust or allergens, or to protects others from your disease vector, or to become something else all together.

Machine learning scoring at the sensor package level – that is being done today by companies such as Simularity.

Machine learning and deep learning being incorporated into software to help guide augmented human decisions and autonomous machine decisions – a variety of companies, such as the ones we wrote about in our Data Grok posts over the past few years.

Artificial narrow Intelligence – is appearing in everything from chatbots to surgical robots, and is being investigated by more companies than we can add to this post.

Soft sensors – are currently being researched mostly in the textile and fashion industries.

IoT Architecture that includes hardware, firmware and software from the sensor to the Fog and Edge, through multiple intermediate aggregation points into a distributed Core of on-premises and multi-Cloud infrastructures and services – not implemented anywhere that I know of, and our own development of this architecture is still nascent.

Completion of the 5Cs IoT Maturity Model that we help to develop in 2014, and are still working on today – again, not that I know of.

Fully augmented smart cities – there are projects and megaprojects and conferences everywhere, but all silo'd and incomplete to date.

A sensor analytics ecosystem that would allow this to occur, with proper provisioning of privacy, transparency, security and convenience while building trust through two-way accountability – not yet, and perhaps never, but something that we are working toward.

Once, most data quality issues were from human errors and inadequate business processes. While these still exist, new data sources, such as sensor data and third-party data from social media, openData and "wisdom of the crowd" introduce new sources of potential error. And yet, the old ways of storing "data" in log books, engineering journals, paper notes and filing cabinets are still widely practiced. At the same time, data quality is more important than ever as organizations rely more on predictive algorithms, machine learning, deep learning, artificial intelligence and cognitive computing. The basics of data quality have remained the same, but the means by which we can assure data quality are changing.

Data Quality Basics

Fundamentally, data quality is about trust; that the decisions made from the data are good decisions, based upon trustworthy data. To achieve this trust, data must be:

correct

valid

accurate

timely

complete

consistent

singular (no duplications that affect count, aggregates, etc)

unique

[have] referential integrity

[apply] domain integrity (data rules)

[enforce] business rules

Now, these principles must be applied to all the new sources and uses of data, often as part of streaming or real-time decision support, automated decisions, or autonomous systems.

Moreover, the data rules and the business rules must reflect reality, including evolving cultural norms and regulatory requirements. For example, in many areas of the world, gender is no longer based simply on biology at birth, but includes gender identification that may be more than just male or female, and may change over time as an individual's self-awareness changes. As another example, regulations in some areas of the world are imposing stricter restrictions around individual privacy, such as the General Data Protection Regulation (GDPR) in the EU with full application coming in May of 2018.

Data Verification

Third-party data verification tools have been around for decades, are often purchased and installed on-premises, including their own databases of information. Today, data verification may be done through such tools, or through openData and openGov databases; modern data preparation tools may even recommend freely available data sources, such as demographic data, to enhance and verify the data that your organization has collected or generated. Other data, such as social media data, is also available to enhance your understanding of customers, markets, culture, regulations and politics that might influence your decisions. Current third-party data is most often accessed through Application Programming Interfaces (APIs) that may be HTTP or ReSTful, or might be proprietary. Use, or rather, misuse of these APIs have the potential to degrade, rather than enhance your decisions support process. Another issue is that you may not know how third-party data is governed according to the basics of data quality. Again, modern data preparation and API management tools can help with these issues, as can open architectures and specifications.

Data from sensors and from sensor-actuator feedback loops, aren't new. Data from connected sensors, actuators, feedback loops, and all kinds of things, from pills to diagnostic machines, from wearables to cars, from parking sensors to a city's complete transportations system, some of which may be available through openGov initiatives, are new. Many of the organizations using such IoT data have never used such data before.

Now that we have taken a very brief look into data quality and new opportunities, let's go into the new tools we have to use these new data opportunities.

Data Stewardship through AI

In the spirit of drinking one’s own champagne, many of the new uses of data – the output of data science – are being applied to data management. As software has consumed the world, machine learning is eating software; deep learning and artificial intelligence are rapidly becoming the top of this food chain. Once, a dozen or so source systems made for a good size data warehouse, with nightly ETL updates. Now, organizations are streaming hundreds of sources into data lakes. The people, processes and technologies for data quality can only keep up through augmentation through the use of advanced analytic algorithms. Machine Learning uses metadata to continuously update business catalogues as artificial intelligence augments the data stewards. Metadata is changing as well, to provide semantic layers within data management tools, and to better understand the data sets coming from the IoT, social media, or open data initiatives.

The first players to apply these techniques to data management and analytics became our first "Data Grok" companies, data that helps humans grok data and how that data can be used. Since then, the first companies to earn the DataGrok designation, Paxata and Ayasdi, have been joined by many others adding machine learning, deep learning and even artificial narrow intelligence (ANI) to provide recommendations and guardrails to data scientists, data stewards, business analysts, and any individual using organizational data to make decisions.

Data Quality Relations

Data Management development through the execution of enterprise architecture, policies, practices and procedures encompasses the interaction among data quality, data governance, and data integrity. Regulatory and process compliance are dependent upon all three. Ownership of each data set, data element and even datum, is critical to assuring data quality and data integrity, and is the first step to providing data governance. Business metadata, technical metadata and object metadata come together through business, technical and operational ownership of the data to build data stewardship and data custodian policies. The architectural frameworks used for Enterprise, IoT and Data architectures result in specifications for each critical data element that provide an overarching view across all business, technical and operational functions.

Data governance interacts with architectural activities in an agile and continuous improvement process that allows standards and specifications to reflect changing organizational needs. The processes and people can assure that data specifications are applicable to the needs of each organizational unit while assuring that data standards are uniformly applied across the organization. The size and culture of an organization determines the formality and structure of data governance and may include a governing council, sponsorship at various organizational levels, executive sponsorship (at a minimum), data ownership, data stewardship, data custodianship, change control and monitoring. But even with all this, the goal of data governance must be to provide appropriate access to data, and not restrict the use of data…from any source.

IT Must Adapt

Information Technology has often been seen as a bottleneck. Many times in our consulting work, we have found ourselves in the position of arbiter between IT and the business. Self-service BI, Analytics and Data Preparation mean IT must become an enabler of data usage, providing trustworthy data without restricting the users. The productionalizing of data science again means that IT must be an enabler of data usage, including the machine learning and other advanced analytics models that data science teams produce. As data science and data management & analytics tools come together, the need for IT to guide the use of data and tools without limiting that use becomes paramount. At the same time, privacy and security must be retained within data governance. Patient data must only be available to the patient and those healthcare professionals and caregivers who require access to that data. Personally Identifiable Information (PII) must be controlled. Regulatory compliance, such as GDPR and PCI, must be adhered to.

There is also a need for two-way traceability from the datum to the end-use in reports and analytics, training sets or scoring, and from the end-use to the source system, including lineage of all transformations along the way. This lineage of source and use enables both regulatory compliance and collaboration. Such transparent history also helps builds trust in the data, and in what other users and IT data management professionals have done to the data.

IT and OT must Work Together

As connected products mature through the 5Cs of our IoT maturity model (connection, communication, collaboration, contextualization and cognition), information technology and operations technology, business systems and engineering systems, must share data under a unified architecture. Much of the promise of the IoT can only be achieved through IT and OT working together. Consumer and marketing information being merged with supply chain and production quality information to build predictive models that allow just-in-time inventory control and agile, custom product delivery is only one example of changes to consumer expectation, whether that consumer is another business, a government or an individual. Industries from every market, such as the energy sector, consumer packaged goods and pharmaceutical manufacturing have reaped the benefits of IT and OT working together, of SCADA/Historians data being integrated with Cloud marketing and sales data or ERP data. But for this partnership between IT and OT to work, they each must trust the data of the other, and that only happens through data governance and data quality efforts.

Metadata and Master Data Management in DQ

Metadata and Master Data Management (MDM) are fundamental in ensuring data quality, and key to using trustworthy data throughout a modern data ecosystem from the most modern data sources and analytic requirements at the Edge to the most enduring legacy systems at the Core; from the droplets in the Fog to the globally distributed multi-Cloud and hybrid architectures. Metadata and MDM have been part of the solution all along, but now must be applied in new ways, both at the core and at the Edge, and distributed through multiple Cloud, hybrid architectures, on-premises, and out into the furthest reaches of the Fog, as all these resources elastically scale up and down at need.

Sensor Data Makes for Interesting DQ

Some of us have been dealing with sensors, sensor-actuator feedback loops and the concepts of the large, complex system for all of our careers, but for many, the fundamentals of connected hardware will be new. Sensor data can be messy. Two sensors from the same manufacturer will be slightly different in the data sets produced, even though they both meet specification; two sensors from different manufacturers will certainly be different in center point, range, precision and accuracy, and how the data are packaged. Sensors drift over time, and will need calibration against public standards. Sensors age, and may be replaced, and both of these conditions affect all the previous points.

Data architecture and DQ

Having worked in System Engineering for aerospace, I go to Deming's definition of Quality as conformance to specifications well suited to the customer, and, for data, specifications come from the architecture.

Architecture abstracts out the organizational needs as a series of views representing the perspectives of the people, processes and technologies affected by and effected through that solution, system or ecosystem. A standalone quality solutions architecture is not a good idea, as quality must be pervasive through an architecture. However, adding quality as a view within an architecture assures that data quality, data governance and compliance are properly represented within the architecture. {Though outside the scope of this post, I would also consider adding security as a separate view.} There are many architectural frameworks, and even controversy about their effectiveness; TOGAF, MIKE2, 4+1 and BOST are the main frameworks. Architectural frameworks focus on enterprise, data and solutions (application) architectures, with a recent interest in Internet of Things (IoT) architecture. Adherence to a framework or method is not as important as that the process by which an architecture is created meets the culture and needs of the organization.

Standards

For reference purposes, here are a list of data quality standards and methods that you might find useful:

We began using Informatica in its very early days. By 1998, we were using it for an ambitious enterprise data warehouse project spanning three divisions of a Fortune 100 company, taking in transactional and operational data from over 40 operating companies. The days are long gone when we would have implemented complex data architectures and data flows using Informatica Power Center and Power Mart in hub-and-spoke arrangements. But the need to provide powerful data management for analytics around business processes has only grown, as sales, services and customer touch-points have grown. We now generate data every minute of the day, awake or asleep. We tweet, email, and post to social media, personal blogs, and photography and video sharing sites. The things that make the things we use, and all the things around us have embedded computers and are sensor enabled, and generate even more data. Because of this, we have changed the focus of data management from simply extracting from common source systems, transforming so all the data conformed to internal standards, and loaded into that mystical single source of truth [the ETL of old]. Today, our focus is on discovering and exploring data relevant to our organizational and individual needs, no matter the source. And yet, all this data must be vetted; data quality and data governance are more important than ever. While the idea of a single source of truth is passé, trust in our data is not. Whether we are trying to improve our personal fitness or determine the impact of the latest marketing campaign, or bring the perpetrators of genocide to justice, we expect consistency in the answers to the questions we ask of all these sources of data.

Informatica has been amazingly innovative in expanding its capabilities for data management. Informatica solutions and products keep up with where industry is going. Informatica was one of the first data management companies to realize the importance of the Internet of Things (IoT). Their development of the Intelligent Data Platform is seen as a hallmark in handling all these new sources of data. Their attention to metadata and master data management has also improved, and even outpaced, the industry. Informatica can still be deployed on-premises, in one’s own data center, or in private or hybrid clouds, or in public Cloud platforms. Real-time data management, and continuous event processing are also part of Informatica’s suite of products. All of this innovation has been rewarded again today, as for the 11th year in a row, Informatica has been named #1 in Customer Loyalty for data integration. Informatica has earned top marks in customer loyalty in the annual Data Integration Customer Satisfaction Survey conducted by independent research from Kantar TNS.

To show that Informatica is not resting on its laurels, they have also announced today new and enhanced products and services:

In the upcoming webinar for SnapLogic, we will be looking at the Internet of Things from the perspective of data.

What data can be expected

How IoT data builds upon the evolution of data management and analytics for big data

Why IoT data differs from data from other sources

Who can make the most use of IoT data or Who can be impacted most by IoT data

Where IoT data needs to be processed

When IoT data has an impact

Specifically, how the recent evolution of data management in response to big data, is ideally suited in some ways for IoT data, and is still evolving for some unique characteristics of IoT data and metadata.

The business drivers range from new sources of data that can help organizations better understand, service and retain customers, to consolidation in many industries bringing about the need to bring together data from disparate and duplicate information and operation systems after merger and acquisition. One of the more pervasive developments has been the movement of data acquisition, storage, processing, management and analytics, to the Cloud.

Beyond these corporate motives, governments and non-government organizations (NGOs) are using data for good to bring about better quality of life for millions or billions of individuals. Clean water, prosecuting genocide, fighting human trafficking, reducing hunger, and opening up new means of commerce are only a few examples. Some look at the future and see a utopian paradise, others a dystopian wasteland. The IoT with evolving data management and analytics are unlikely to bring about either extreme, but I do think that the future will be better for billions as a result.

The basic question that we’ll ask in this webinar is “What is the Internet of Things?”. From simple connectivity, to the resulting cognitive patterns that will be exhibited by these connected things, we will explore what it means to be a thing on the Internet of Things, how the IoT is currently evolving, and how to bring value from the IoT. It is also important to recognize that the IoT is already here, many organizations are reaping the benefits from IoT data management and sensor analytics. The webinar will show ways in which your organization can join the IoT or mature your IoT capabilities.

Big data was often described by three parameters overwhelming the old ways of integrating and storing data: volume, velocity and variety. Really, we are looking at deftly interweaving the volumetric flow of data in timely ways that flexibly provide for privacy, security, convenience, transparency, governance and compliance. Nowhere is this evolution better expressed than in data management for the Internet of Things (IoT).

We will cover some of the more interesting and useful aspects of preparing for IoT data and sensor analytics. Though coined by Kevin Ashton in 1999, the IoT is still considered in the early stages of adoption and relevance. While the latest trends in data management and analytics apply to IoT data and sensor analytics, there are specific needs for properly addressing IoT data, which legacy ETL (extract, transform and load) and DBMS (database management systems) simply don’t handle well, such as time-series data and location data, as well as metadata specific to IoT. In addition to these characteristics of IoT data, we will explore other aspects that make IoT data so interesting.

The IoT isn’t meeting its hype as yet, which requires many solution spaces coming together as ecosystems. Instead, the IoT is growing within each vertical separately, creating new data silos. This is exemplified by the 30-plus standards bodies addressing IoT data communication, transport and packaging. Metadata and API management can help. Metadata also addresses the nuances of IoT data, such as the factors arising from replacing a sensor that allow continuity of the data set and understanding of the difference before and after the change.

Information Technology (IT) and Operational Technology (OT) are coming together in IoT. This means interfacing legacy systems on both side of the house, such as enterprise resource planning (ERP) and customer relationship management (CRM) systems with supervisory control and data acquisition (SCADA) systems, and relational database management systems (RDBMS) with Historians DBMS. This also means deriving context from the EDGE of the IoT for use in central IT and OT systems, and bringing context from those central systems for use in streaming analytics at the Edge. Further this means that machine learning (ML) is not just for deep analysis at the end of the DMA process; ML is now necessary for properly managing data at each step from the sensor or actuator generating the data stream, to intermediate gateways, to central, massively scalable analytic platforms, on-premises and in the Cloud.

As we discuss all of this, our participants in today’s webinar will come away with five specific recommendations on gaining advantage through the latest IoT data management technologies and business processes. For more on what we will be discussing, visit my post on the SnapLogic Blog. I hope that you’ll register and join the conversation on 2016 October 27 at 10:00 am PDT.

Contest

During the week of 2017 September 26 at the O'Reilly Strata-Hadoop conference in New York City, Kognitioannounced the start of their contest looking for the best use case or application of Kognitio-on-Hadoop. Kognitio are looking for innovative solutions that include Kognitio-on-Hadoop. Innovation is defined by Kognitio as

Innovation could be a novel or interesting application or it could be something that is common place but is now being done at scale.

This covers a wide range of potential big data analytics use cases that might include data-for-good, government, academic or business applications. Contestants must write-up their use case in a short paper, to be submitted to Kognitio no later than 2017 March 31. Applications will be judged by a named panel headed by a leading industry analyst. The winner will be notified on 2017 June 01. Applicants can be individuals, groups or organizations. The winner may chose among the following three prizes:

US$5,000.00

A one year standard support contract

A one year internship at Kognitio’s R&D facility in the UK – subject to the intern being eligible to work in the United Kingdom

Kognitio on Hadoop is free to download; registered entrants will receive notifications of patches and updates to the free software, as well as preferential support on the Kognitio forums.

Kognitio

As one of the first in-memory, massively parallel processor (MPP) analytics platform, Kognitio has over 25 years of experience to bring to big data processing…always in-memory, MPP and on clusters. Today, the Kognitio Analytical Platform is delivered via appliances, software, and cloud. Kognitio on Hadoop was announced at the 2016 Strata-Hadoop conference in London. This free-to-use version of the Kognitio Analytical Platform includes full YARN integration allowing Hadoop users to pull vast amounts of data into memory for data management and analytics (DMA). As an in-memory MPP analytical platform, Kognitio is very scalable and can provide MPP execution of any computational statistics or data science applications. MPP of SQL, MDX, R, Python and other languages, for advanced analytics, is handled through bulk synchronous parallel (BSP) API. This provides extremely fast, high concurrency access to the data. In addition to these languages, Kognitio has a strong partnership with business intelligence vendors, such as Tableau, Microstrategy and others. For Tableau, Kognitio has a first-class connector; and, for example, a joint customer in the financial services market, with 10,000 customers accessing nine petabytes (9PB) of data in Hadoop [five terabytes (5TB) in Kognitio]. As example of the high concurrency available through Kognitio, the financial services customer routinely sees 1500-2000 queries per second from ~500 concurrent sessions. Now, know that this is an analytical subsystem; there are another 15 such uses of Kognitio, for specific purposes, accessing that 9PB data lake.

Kognitio on Hadoop

Kognitio on Hadoop can be downloaded free of charge and with no data size limits or functional restrictions. This download is available without registration. There is a range of paid support options available as well. Kognitio on Hadoop is integrated with YARN, and works on any existing Hadoop infrastructure. Thus, no additional hardware is required solely for Kognitio. Kognitio on Hadoop accesses files, such as CSV files, stored on Hadoop, in HDFS, as one would normally store data in Hadoop. Intelligent parallelism in Kognitio 8.2 allows queries to be assigned to as few as one core, or to use all cores, allowing for extraordinarily high levels of concurrency. This apportionment is performed dynamically by Kognitio. In addition to the obvious advantages of such a mature product, as free-to-use, Kognitio on Hadoop can be much more easily deployed, tested, and brought into production, while many open source solutions are still trying to run in a lab. Kognitio on Hadoop was developed internally using Apache Hadoop. Kognitio on Hadoop is in production at customers on Apache Hadoop, and the distributions from Cloudera, Hortonworks and MapR.

As the Internet of Things matures beyond simple connectivity and communication, in-memory MPP analytical platforms, such as Kognitio on Hadoop, will be required to allow context to be derived from intelligent sensor packages and Edge gateways, to the Cloud, and provide context to the Edge, Fog and sensors, in real-time. Kognitio on Hadoop conceivably allows true collaboration and contextualization among things and humans in sensor analytics ecosystems.

The TeleInterActive Press is a collection of blogs by Clarise Z. Doval Santos and Joseph A. di Paolantonio, covering the Internet of Things, Data Management and Analytics, and other topics for business and pleasure.
37.540686772871
-122.516149406889

XML Feeds

Mindmaps

Our current thinking on sensor analytics ecosystems (SAE) bringing together critical solution spaces best addressed by Internet of Things (IoT) and advances in Data Management and Analytics (DMA) is updated frequently. The following links to a static, scaleable vector graphic of the mindmap.