Succeeding with AI in a Data Driven World

Whether from computer scientists or science fiction authors, we’ve been hearing of the inevitable rise of artificial intelligence (AI) for years. Now, at last, it seems to be real. Rapid developments in machine learning (ML), the ready availability of scale-out cloud computing, mobile applications that generate masses of data, almost limitless cloud storage, and open source ML toolkits have accelerated adoption of AI fueled products and services. We read daily about the successes of Google Brain, IBM Watson, Azure ML and other teams developing ML based applications whose performance now often surpasses that of humans. Research continues to deliver stunning progress, even inference at the speed of light.

For enterprises, the road to ML-derived insights has been rocky. Seemingly all one needs is plenty of data, a big-data infrastructure, a few skilled engineers, and an effort to integrate open source code bases such as Google TensorFlow, Apache Beam, Flink and Spark. Sadly, most organizations are stuck with the first: They are just drowning in data. Users are realizing that ML is not yet a consumable technology like, say, a database. They lack data science skillsets needed to manage streaming, clean data, create and train models, deploy them and manage their use. They lack the software engineering expertise to combine a polyglot soup of open source code bases into a functioning, maintainable solution and they aren’t yet comfortable committing to a cloud platform because even though the major cloud vendors have made components easier to consume, it’s an uphill battle for non-cloud-native organizations to build data pipelines and manage their lifecycles. The most challenging task is building accurate models of the edge environment being monitored, and success demands skilled data scientists to clean data, train models, and deploy and manage them. Most organizations lack the skills to do this.

The way out of this quandary lies in re-thinking the streaming data software architecture with a goal-first mindset. Notice that “big-data storage,” “data flow pipelines,” and “streaming” are not goals. They are infrastructure technologies. Nor is “AI/ML.” That’s just a way to achieve the goal, which is to deliver real-time insights from streaming data, as it is produced. Do you need to store the data? Do you need a centralized architecture and lots of computing power? In many cases the answer is “no” – but to see this requires a different view of the edge.

To do this, we need to turn the “infrastructure first” paradigm on its head – focusing not on the infrastructure, but on the insights we want. Rather than designing data pipelines and choosing packages for data transformations and storage, cleaning, training and inference, we need to automate two fundamental needs: building a parallel, stateful “pipeline” for data automatically and bypassing the need for a data scientist to build a model of a complex real-world environment, train it and use it to deliver insights. Instead, we need to build a model of the edge environment in real-time, from streaming data.

Of course, we need to know what entities are represented in the data. Then, using a stateful architecture we can allow the streaming data to build the model of the real-world, on-the-fly, as the data is received. A stateful digital twin model of the edge can be efficiently managed in memory, and updated at the rate at which data is received (whereas cloud-hosted big-data stores are slow). In this architecture – also known as the distributed actor model – digital twins represent entities in the data, and statefully process data in real-time from their real-world siblings – devices, applications and other infrastructure. As data arrives, each digital twin processes its own data – cleaning and analyzing it. With this architecture, all that a user needs to do is specify how each twin should analyze (and even display) its data. Digital twins can perform simple analytics, or collectively drive inputs into much more complex analytical functions, including self-training and inference using complex DNNs.

Digital twins can also collaborate to perform sophisticated analyses that require inputs from multiple entities. These could include simple “joins” of the evolving state of different twins, sophisticated correlations and even machine learning. The specification of what is required needs to be easy for a traditional developer, with skills in a high level language like Java, who can simply invoke functions for analysis or learning from any major toolset.

And what of the original data? Well, if you need it, save it. But in the vast majority of use-cases for streaming data, there is no need to save data that is only ephemerally useful. Instead, digital twins can learn on the fly, at the edge.

Succeeding with AI in today’s enterprise requires us to free users from the challenges of infrastructure needed to collect, store, analyze and model data. It requires an abstraction at the level of digital twins and how they relate and share context – to allow users to easily describe their environments and the relationships between entities in order to derive insights. SWIM provides that abstraction and takes care of all the complexity underneath: Building stateful digital twins that learn from the streamed data of their real-world siblings and describing analytics, training and inference that are required based on the observations of multiple twins that build a smart infrastructure.

About the Author

Simon Crosby is CTO of SWIM.AI. Simon brings an established record of technology industry success, most recently as co-founder and CTO of Bromium, a security technology company. At Bromium, he built a highly secure virtualized system to protect applications. Prior to Bromium, Simon was the co-founder and CTO of XenSource before its acquisition by Citrix, and later served as the CTO of the Virtualization and Management Division at Citrix. Simon has been a tenured faculty member at the University of Cambridge. Simon Crosby was named one of InfoWorld’s Top 25 CTOs in 2007.

Resource Links:

Industry Perspectives

In this special guest feature, Brian D’alessandro, Director of Data Science at SparkBeyond, discusses how AI is a learning curve, and exploring opportunities within the technology further extends its potential to enable transformation and generate impact. It can shape workflows to drive efficiency and growth opportunities, while automating other workflows and create new business models. While AI empowers us with the ability to predict the future — we have the opportunity to change it. [READ MORE…]

Latest Video

White Papers

The data catalog has come from nowhere in the past five years to become a key enabling technology for multiple use cases including self-service analytics, self-service data preparation and multi-location data management. Download a new white paper from Unifi Software that explores the data catalog as a major data management breakthrough, as well as its importance in enabling modern analytics architecture.