Tag Archives: Expert

At the recent Strata Data conference in NYC, Paige Roberts of Syncsort has a moment to sit and speak with Paco Nathan of Derwen, Inc. In part one of the interview, Roberts and Nathan discuss the origins, current state, and the future trends of artificial intelligence and neural networks.

In the second part, Roberts and Nathan go into the current state of Agile and deep learning.

Roberts: Changing the subject a little, one of the other things you talked about which kind of struck me pretty strong is basically the father the Agile says, don’t do Agile anymore. [Laughter]

Nathan: [Laughter] Right!

Roberts: Can you talk about that a little bit?

Nathan: Yeah, I was referencing a recent paper this year, actually just a few months ago, by Ron Jeffries who created Extreme Programming. Pair Programming came out of that. Scrum came out of that. A lot of the things we recognize as Agile came from that. He was one of the signatories of the Agile Manifesto 20 years ago. Recently he came out saying that the definition of Agile that he’s seen floating around in industry don’t have anything to do with the intention that they were trying to strike at. He wrote down, “20 years later, here’s my advice for what you really need to do with your team. Let’s get away from the names, and let’s just really focus on how to make teams better.”

Roberts: Wow. Okay. What’s the paper that he did?

That’s pretty interesting. I think there are tons of software companies right now that for them, that’s the Bible. You have to do Agile to survive.

If you saw the talk by David Talby that was a really good one too. It was called, “Ways That Your Machine Learning Model Can Crash and What You Can Do About It.” He’s done a lot of work, especially in healthcare, with machine learning and he just had case study after case study of what goes wrong. The point there was, the real work is not developing the machine learning model. The real work is once you put it into production, what you have to do to make sure that it’s right, and that’s ongoing.

Yeah. That’s always true.

I heard David’s talk in London five minutes before my talk, and I made a slide to represent some of the things he talked about because it fit in with what I was saying. I showed it and then there were arguments out in the hallway afterwards, because the Agile people were like, “How dare you say that!” It’s really salient because if I’m developing a mobile web app, and I have a team that I’m engineering director of, I’m going to bring in my best architects and team leads early in the process. They’re going to go define the high level definitions and define the interfaces. As the project progresses more into fleshing out different parts of the API and getting into more maintenance mode, I don’t have to have my more senior people involved.

Right.

With machine learning, it is the exact opposite. If I’ve got a dataset, and I want to train a model, that’s a homework exercise for somebody who’s just beginning in data science. I can do that off the shelf. But once you get deployed and start seeing edge cases and the issues that have to do with ethics and security, that’s not a homework exercise. Unless you’re in context, and actually running in production, you’re not going to know in advance what those issues are.

Yeah, but a lot of the conversation now is about the fact that most of your datasets are in some way biased, and there’s a lot of ethics involved in launching a machine learning model. I just saw an article online where they’re making ethics in machine learning a first year course for people that they’re training for ML and AI (Carnegie Mellon, University of Edinburgh, Stanford). I guess it actually speaks a little bit to what you said about putting your experts at the end during production. To a certain extent, it seem to me like you also want to have the experts at the beginning, looking at the data before it even starts the process.

Definitely. Deloitte, McKinsey, Accenture, all of them, when we do executive briefings, they all want it set at the beginning. Before we even talk about introducing machine learning into your company, you need to get your ducks in a row as far as breaking down the data silos, and getting your workflow for cleaning your data in place, and a culture that’s based around using data engineering and data science appropriately. You need to do all of those things before you can even start on machine learning. There’s a lot of foundation that needs to be done correctly.

I said something about the high percentage of machine learning projects that never make it into production on Twitter, and got a response from John Warlander, a Data Engineer at Blocket in Sweden. He said, “I sometimes wonder how many of those ‘not in production’ big data projects happen in companies that don’t even have their ‘small data’ in order. That’s often where most of the low-hanging fruit is.” I’ll put that in my blog post about the Strata event themes and industry trends. We’re talking about a lot of those important themes, so I’ll probably put a lot of quotes from you in it.

David Talby had a great quote, “Really, if you want to talk about AI in a product, what you’re talking about is what you’re going to do once you’re deployed and the products being used by customers. How do you keep improving, because if you’re not doing that you’re not doing AI.”

Well if you’re not doing that, you’re certainly not having that feedback loop. You’ve lost that. When looking at the improvement in accuracy over random chance of any model, there’s always that curve that says this is more and more accurate and then it becomes less and less accurate over time if you don’t constantly retrain your models. One of the themes for Syncsort, as a data engineering kind of company, is making sure that the data that you’re feeding in there is itself constantly refreshed and improved. You said something in your talk that stuck with me. The value in ML and AI right now isn’t as much in iterating through models, or getting the best model, it’s feeding your models the best datasets.

I mean if you want a good data point on that, a lot of these companies, even ones who are leaders in AI, will share their code with you. They’re not going to share their data. That was kind of the punchline of the situation with Crowdflower or Figure Eight. Google bought into self-driving cars, and they realized they could replace a lot of one-off machine learning processes with deep learning, but to do that, they needed really good labelled datasets. Other manufacturers saw their success and wanted to do self-driving cars, too. They hired the talent and the first thing they find out is that if they want to do deep learning, they don’t have enough data, or enough good, labelled data. So, they go to Figure 8 and ask, Hey, can you label our datasets?

Lukas Biewald, the founder of Figure Eight, was talking in San Francisco a couple years ago, saying, “Yeah, for about $ 2 – 3 Million per sensor, we’d be happy to work with you on that.” And he had customers lined up, GM and all the others, because …

Because it’s worth it.

Yeah and if they don’t have it, they’re out of the self-driving car business. It may be a high price but it will likely include years of data.

People focus so much on the models. I have to have the most sophisticated algorithm, …

No. That’s not it.

The only reason that AI didn’t take off back in the 80’s or the 90’s when you and I were first studying it, was because we didn’t have enough data. We couldn’t crunch it. We couldn’t ingest that amount of data and do anything with it, affordably.

There needed to be millions of cat pictures on the internet before we could really do deep learning.

Before we could create something that could identify a cat picture. That’s just the nature of the game.

That was the paper that launched it all. And then the open source for using GPU’s to accelerate it.

That’s really taking off more now in spaces other than video games. Walking the Strata floor, there are a lot more vendors out there taking advantage of GPU’s.

There’s nothing really sacred about the architecture of a GPU with respect to machine learning. It just happens to be faster than a general-purpose CPU at doing linear algebra. But now we’re seeing more ASICs that can do more advanced linear algebra, at enough scale that you don’t have to go across the network. That’s the game. We’ll probably see a lot more custom hardware. Basically we’re in this weird sort of temporal chaos regime where hardware is moving faster than software and software is moving faster than process.

Hardware ALWAYS moves faster than software. Most software is just now finally, in the last few years, catching up to things like using vectors to take advantage of regular CPU chip cache.

And now we’re putting Tensorflow compositions in GPU’s.

Exactly. And we’re creating compute hardware that’s specific to task. Software always lags behind the hardware and then business processes have to develop after that.

Yeah, you have to log some time doing the job before you can really figure out the process. I think you’re company is in a really good space right now. You’ve gotta get the data right. And it’s not just a one-off. You’ve got to keep getting the data right across your company. Now, and forevermore.

Yeah, tracking and reproducing data changes in production is a big challenge for our customers. If you made 25 changes to the data to make it useful for model training, you then have to make those exact same 25 changes in production so that the model sees data in the format it’s expecting. I’m doing a series of short webinars on tackling the challenges of engineering production machine learning data pipelines, including one on tracking data lineage and reproducing data changes in production environments. So is there anything else going on at the moment that you’d like to let us know about?

I have a little company called Derwen.ai. If you check there, we’ve got a lot of articles. It’s my consulting firm and we do a lot of work with the conferences. We get to see a real bird’s eye view, and we hear from all kinds of people. We’re like Switzerland. We get to hear what a lot of people are working on, even if they’re not ready to go public with it. I hear the pain points people are dealing with, and help out the start-ups. It’s kind of like a distributed product management role.

At PowerObjects, we have found that one of the most essential roles on any Microsoft Dynamics 365 team is devoted specifically to the data needs of the project – the data experts. Any enterprise is only as good as the data that it has available to support itself.

Therefore, critical to the success of any D365 project are each of the following:

The work performed by a data expert varies from day to day, depends on the specific phase of the project.

The data expert enters a normal solution build during the PowerObjects’ planning phase, beginning with the efforts to transition from the Sales to the Delivery team – designed to explain/refine the tasks to be performed and to introduce, define the roles of, and empower the members of the combined client team.

Clients are encouraged to identify key indicators and functionality during the planning phase. These indicators will quickly identify the entities, fields, and relationships important to the customer and define the solution developed. Specific subject matter experts (SMEs) on the client side will quickly be identified or make themselves known during early discussions.

This phase focuses on setting the expectations of the client; translating the functional requirements of the client into the technical requirements and design of the solution; and identifying the new/modified/retained business processes involved. Together, this planning will define and refine the scope, cost, and timeline of the D365 project at hand. The planning phase aids the client team in understanding their own requirements, as the structure of D365 encourages process improvement. However, D365 should not be considered or presented as “the process.”

Each data migration/integration effort requires great care, thoroughness, and detailed planning. The work of the data expert normally begins in earnest during this phase – with the identification of unique identifiers of all source data, identification and definition of all source data elements to be migrated, and the mapping of the source data to be migrated and/or integrated into specific entities and fields within the D365 solution.

Simply put, mapping the data is determining the destination of the data currently stored in the current legacy system(s) into the corresponding D365 data structures, while confirming the format of the source and destination data. Most of the time, we use a series of spreadsheets to build out and refine this mapping work. The more specific these documents, the easier it will be to execute the migration/integration work.

Note that not all legacy data will be migrated. Historic data stored in a customer’s legacy system is often found to contain duplicate, inconsistent, incomplete, or outdated information. The data expert may identify risks associated with certain “dirty” data while working with a client’s data set. They will offer client team members assistance in data quality and data cleansing methods and best practices; but data cleansing work is normally the domain of the client. History has proven that it is better to resolve data quality/cleanliness issues PRIOR to migrating data – ensuring only the most useful and cleanest data will be moved into the D365 instance.

Furthermore, data in the source systems may not be aligned exactly with the destination D365 entity receiving the data. This requires that the entire “column” of data be manipulated to match what D365 is expecting to receive. These individual steps are sometimes described as “transformation formulas.” For example, the existing name field may need to be cut up or “parsed” into the D365 firstname and lastname fields, or all telephone numbers must be presented in a certain format (e.g., “###.###.####”).

Data Migration is the process of moving data from one system to another. Considerations include:

Format of the data transferred.

Planned, one-time (or infrequent) transfer of data.

Data Integration is the process of building and maintaining the synchronizing (transfer) of data between systems. Considerations include:

Format of the data transferred.

Recurring transfer of data.

Frequency of the transfer.

What will trigger the transfer.

Additional factors that will affect the level of effort and duration of a data migration/integration task are:

Number and size of the source data legacy system(s).

Volume of data (the number of tables, rows, columns to be processed).

Types of data.

Use of option sets or “picklists.”

Number and complexity of the transformational formulas required to put the data into a format acceptable to the destination entity field.

Specific care must be paid to the steps required to match, dedupe, and integrate distinct source data available, as well as the coordinated migration of data into the developmental (or Sandbox) instance.

The integration mapping will often follow that of the migration mapping, which can be used as the starting point of integration mapping efforts. The mapped-to destinations will often coincide, but the source of data and the processes to get to that destination will vary. Therefore, integration and migration processes should be conceived and developed as separate functionality.

The most satisfying part of the data expert’s work is seeing the populated data entities’ data joined with the work of the application developers – representing the visualization of the client’s expectations on screen. Once data is merged with forms, reports, and dashboards of the application, and then viewed by the customer through the eyes of the D365 toolset, the connections of the functional and the technical requirements are seen. his is also the opportunity to fine tune the components of the solution. Fine tuning allows our team to deliver the solution originally envisioned, prepare the solution and the client for user acceptance testing, and ultimately deploy the data migration and integration components as integral components of the overall D365 solution to the client’s production environment.

We hope this gives you a view into what being a data expert is like! We’re always looking for great talent at PowerObjects, check out open roles and apply on our website here.

Since Syncsort recently joined the Hyperledger community, we have a clear interest in raising awareness of the Blockchain technology. There’s a lot of hype out there, but not a lot of clear, understandable facts about this revolutionary data management technology. Toward that end, Syncsort’s Integrate Product Marketing Manager, Paige Roberts, had a long conversation with Wikibon Lead Analyst Jim Kobielus.

In the first part of the conversation, we discussed the basic definition of what the Blockchain is, and cut through some of the hype surrounding it. In the second part, we dove into the real value of the technology and some of the practical use cases that are its sweet spots. In this final part, we’ll talk about the future of Blockchain, how it intersects with artificial intelligence and machine learning, how it deals with privacy restrictions from regulations like GDPR, and how to get data back out once you’ve put it in.

Roberts: Where does Blockchain go from here? What do you see as the future of Blockchain?

Kobielus: It will continue to mature. In terms of startups, they’ll come and go, and they’ll start to differentiate. Some will survive to be acquired by the big guys, who will continue to evolve their own portfolios, while integrating those into a wide range of vertical and horizontal applications.

Nobody’s going to make any money off of Blockchain itself. It’s open source. The money will be made off of cloud services, especially cloud services that incorporate Blockchain as one of the core data platforms.

Believe it or not, you can do GDPR on Blockchain but, here’s the thing: the GDPR community is working out exactly what you can do to delete the data records consistently on the Blockchain. Essentially, you can encrypt the data and then delete the key.

Right. If you can’t decrypt it, you can’t ever read it.

Yeah. Inaccessible forever more in theory. That’s a possibility of harmonizing Blockchain architecture with the GDPR and other mandates that require the right to be forgotten. The regulators also have to figure out what is Kosher there. I think there will be some reconciliation needed between the techies pushing Blockchain, and the regulators trying to enforce the various privacy mandates.

Just as important in terms of where it’s going, Blockchain platforms as a service, PAAS, will become ever more important components of the data providers overall solutions. Year by year, you’ll see the Microsofts, IBMs and Oracles of the world evolve Blockchain-based Cloud services into fairly formidable environments.

There are performance issues, in terms of speed of updates with Blockchain now, but I also know that there is widespread R & D to overcome those. VMWare just announced they’re working on a faster consensus protocol, so that different nodes on the Blockchain can come to consensus rapidly, allowing more rapid updates to the chain. Lots of parties are looking for better ways to do that. So, maybe it might become more usable for transactional applications in the future.

Blockchain deployment templates are going to become the way most enterprise customers power this technology. AWS and Microsoft already offer these templates for rapid creation and deployment of a Blockchain for financial or supply chain or whatever. We’re going to see more of those templates as the core way in which people buy, in a very business friendly abstraction. There will be a lot of Blockchain-based applications for specific needs. We’ll see a lot of innovation in terms of how to present this technology and how to deliver it so that you don’t have to understand what a consensus protocol is or really give a crap about what’s going on in the Blockchain itself. It should be abstracted from the average customer.

More in terms of going forward, you’ll see what I call “Blockchain domain accelerators.” There are Blockchain consultants everywhere now. There are national Blockchain startup accelerators. There are industry-specific Blockchain startup accelerators. There are Blockchain accelerators in terms of innovation of cryptocurrency and Internet of Things. We’re going to see more of these domain accelerator industry initiatives come to fruition using Blockchain as their foundation. They’ll analyze and make standards of how to deploy, secure and manage this technology specific to industry and use case requirements. That definitely is the future.

As I mentioned before, it will become a bigger piece of the AI future, because of Blockchain-based distributed marketplaces for training data. Training data for building and verifying machine learning models for things like sentiment analysis has real value. There’s not many startups in the world that would have massive training datasets already. To build the best AI, you’ll need to go find the best training datasets for what you’re working on.

I talked about that a little with Paco Nathan at Strata, how labelled, valid, useful training datasets were incredibly valuable now, and AI companies recognize that. They will share their code with you, but not their data, not for free.

I really think you’ll see a lot more AI training dataset marketplaces with Blockchain as the backing technology. It’s going to become a big piece of the AI picture.

Blockchain security is another big thing going forward. The Blockchain is the weak link is in protecting your private keys, which provide you with secure access to your cryptocurrencies that are running out of the chain. What we’re going to see is that there will be more emphasis on security capabilities that are edge-to-edge in terms of securing Blockchains from the weakest link, which is the end-user managing their keys. I think you’ll start to see a lot of Blockchain security vendors that help you manage your private keys, and also smart contracts. Smart contracts on the Blockchain have some security vulnerabilities in their own right. We’ll see a lot of new approaches to making these tamper-proof. There’s already a lot of problem with fraud.

I think I’ve covered most of the big things I see coming. That is the really major stuff.

One more thing, I’m curious about since Blockchain is still fairly new to me. There’s a lot of conversation about how you store data on the Blockchain, and a lot of research into things like securing it, and speeding up update speed, but storing data is only half the story with data management. Once you’ve put all this data in, you have to then get it out. If I’ve got a Blockchain, it has all this information I need, how do I go find and retrieve information from it? Do I use SQL?

There’s a query language in the core Blockchain code base.

So, it has its own specific query language, and people will have to learn a whole other way to retrieve data?

Basically, the core of Hyperledger has got a query language built in. It’s called Hyperledger Explorer. Hyperledger, in itself, is an ecosystem of projects just like Hadoop is and was, that will evolve. It’ll be adopted at various rates, some projects will be adopted widely, and some very little during production Blockchain deployments.

There’s some parallels with early Hadoop. Some of the early things that Hadoop had under their broad scope, they had an initial query language that didn’t take off, they updated that, and improved it with HiveQL. Same thing with Spark. They started out with a query language Shark, and switched to another one, Spark SQL.

We have to look at the entire ecosystem. Over time, some pieces may be replaced by proprietary vendor offerings, or different open source code that does these things better. It’s part of the maturation process. Five years from now, I’d like to see what the core Blockchain Hyperledger stack is. It may be significantly different. It may change as stuff gets proved out in practice.

Yeah, Hadoop changed a lot over the last decade.

Hadoop has become itself just part of a larger stack with things like Tensorflow, R, Kafka for streaming. Innovation continues to deepen the stack. The NoSQL movement, graph databases, the whole data management menagerie continues to grow. We’ll see how the core protocol of Blockchain evolves too. It’s a work in progress, like everything else.

I’ve written a bunch of articles on this. It’s changing all the time.

I’ll be sure to include some links in the blog post, so folks can learn more. I really thank you for taking the time to speak with me. It was really informative.

No problem. I enjoyed it.

Jim is Wikibon’s Lead Analyst for Data Science, Deep Learning, and Application Development. Previously, Jim was IBM’s data science evangelist. He managed IBM’s thought leadership, social and influencer marketing programs targeted at developers of big data analytics, machine learning, and cognitive computing applications. Prior to his 5-year stint at IBM, Jim was an analyst at Forrester Research, Current Analysis, and the Burton Group. He is also a prolific blogger, a popular speaker, and a familiar face from his many appearances as an expert on theCUBE and at industry events.

Since Syncsort recently joined the Hyperledger community, we have a clear interest in raising awareness of the Blockchain technology. There’s a lot of hype out there, but not a lot of clear, understandable facts about this revolutionary data management technology. Toward that end, Syncsort’s Integrate Product Marketing Manager, Paige Roberts, had a long conversation with Wikibon Lead Analyst Jim Kobielus.

In the first part of the conversation, we discussed the basic definition of what the Blockchain is, and cut through some of the hype surrounding it. In this second part, we dove into the real value of Blockchain technology and some of the practical use cases that are its real sweet spots.

Roberts: The hype cycle tends to make all kinds of wild claims. It will do everything but wash your socks. Which claims for Blockchain do you feel have some validity?

Kobielus: First of all, since it doesn’t support CRUD, it’s not made for general purpose database transactions. It’s made for highly specialized environments where you need to have a persistent immutable record like logging. Logging of security related events, logging system events for later analysis correlation, etc. Or, where you have an immutable record of assets, video, music and so forth, in a marketplace where these are intellectual property that need protection against tampering. If you have a tamper-proof distributed record, which is what Blockchain is, it’s perfect for maintaining vast repositories of intellectual properties for downstream monetization. Or, for tracking supply chains.

A distributed transaction record that can’t be repudiated, that can’t be tampered with, that stands up in legal situations is absolutely valuable. So, Blockchain makes a lot of sense in those kinds of applications. In addition to lacking the ability to delete and edit the data, Blockchain is slow. It’s not an online transactional database. Updates to the chain can take minutes or hours depending on how the chain is set up, and how extensive the changes are, so you can’t have a high concurrency of transactions. It’s just not set up for fast query performance. It’s very slow.

Also, in a world moving towards harmonization around privacy protection, consistent with what the European Union has done with the General Data Protection Regulation (GDPR), and the recent California privacy regulation that is similar to GDPR. GDPR requires that any personally identifiable information (PII) must be capable of being forgotten, meaning people have the right to request deletion of their personal data, or to edit it if it’s wrong. In Blockchain, you can’t delete, and you can’t edit a record that’s written in Blockchain. There’s a vast range of enterprise applications that have personally identifiable information. The bulk of your business, sales, marketing, customer service, HR, etc. has tons of PII data.

So, Blockchain is not suitable for those core transaction processing applications. Any application that demands high performance queries will not be on the Blockchain. It’s not suitable for highly scalable real-time transactions of any sort, whether or not they involve PII data.

The way I see it, Paige, is there’s a range of fit for purpose data platforms in the data management space. There’s relational databases, all the NoSQL databases, there’s HDFS, there’s graph databases, key-value stores, real-time in-memory databases, and so on. Each of those is suited to particular architectures and use cases, but not to others. Blockchain is fundamentally a database, and it’s got its uses. It’s not going to dominate all data computing like a monoculture, no matter what John McAfee says. That’s not going to happen. It’s already limited technologically, and with regulatory limitations. It’s a niche data platform that’s finding its sweet spot in various places.

You mentioned a couple of good use cases like supply chain management. I’ve heard of uses like tracking diamonds from the mine to the jewelry store to be certain of their origins, that they’re not blood diamonds. All of the examples I had heard of in the past were based on the concept of Blockchain as a transactional ledger or even a sensor log. For example, you keep sensors on your food from the farm to the market to make sure that it never went above a certain temperature for a certain amount of time, that sort of thing. One of the use cases you mentioned was actually news to me, that you could store other sorts of data like application code, so you could do code change management with it. What other use cases do you see coming?

Actually, there are a few pieces that I published recently for vertical application focused supply chain management. Blockchain startups are trying to grab a piece of the video streaming market. Essentially these services, a lot of which are still in alpha or beta pre-release phase, use Blockchain in several capacities. One for distributed video storage. Number two, for distributed video distribution from a peer-to-peer protocol.

Distributed video monetization using a Blockchain-based cryptocurrency that’s specific to each environment to help the video publishers monetize their offering. Blockchain for distributed video transactions, and for contracts. Blockchain for distributed video governance.

So are you talking about having something like Netflix bucks?

More and more Blockchain applications aren’t one hundred percent on the Blockchain. They handle things like PII off the chain, for instance, and put that in a relational database. Most architecture is using fit-for-purpose data platforms for specific functions in a broader application. That is really where Blockchain is coming into its own.

Another specialized Blockchain use case is artificial intelligence, one of my core areas. I’ve been reading for a while now about the AI community experimenting with using Blockchain as an AI compute brokering backbone; there’s a company called Cortex. You can read my article on that. They use Blockchain as a decentralized AI training data exchange. They have data that has the core ground truths a lot of AI applications need to be trained on.

So you’re saying they basically create really solid, excellent training datasets, doing all the data engineering to make sure these are good training datasets for AI ground truths, and then use Blockchain to exchange them to other AI developers?

It’s a Blockchain for people who built and sourced their training data to store it in a ledger so that others can tap into that data from an authoritative repository.

Right. Okay. That makes sense. Seems like a valuable commodity to the AI community.

Several small companies are doing this. They’re converging training data into an exchange or marketplace for downstream distribution to data scientists, or whoever will pay for the training data. Blockchain is used as an AI middleware bus, an AI audit log, an AI data lake.

What I’m getting at, Paige, is that there are lots of industry-specific implementations of Blockchain. Industries everywhere are using this, some in production, but many of them are still piloting and experimenting with Blockchain in a variety of contexts including e-commerce, AI, video distribution, in ways that are really fascinating.

These are the same kinds of dynamics that we saw in the early days of Hadoop and NoSQL and other technologies. Each technology market grows by vendors finding a sweet spot, an application that their approach is best suited to.

We see a lot of hybrid data management approaches in companies that use two or more strategies in a common architecture.

One thing that’s missing from all that stuff is real-time streaming, continuous computing applications. Blockchain is very much static data, it’s almost the epitome of static data. You won’t see too many real-time applications for Blockchain alone, but that’s okay. Blockchain is good for the things that it’s good for.

Blockchain will find its niche given time?

Yes.

Be sure not to miss Part 3 where we’ll talk about the future of Blockchain, how it intersects with artificial intelligence and machine learning, how Blockchain deals with privacy restrictions from regulations like GDPR, and how to get data back out of the Blockchain once you’ve put it in.

Jim is Wikibon’s Lead Analyst for Data Science, Deep Learning, and Application Development. Previously, Jim was IBM’s data science evangelist. He managed IBM’s thought leadership, social and influencer marketing programs targeted at developers of big data analytics, machine learning, and cognitive computing applications. Prior to his 5-year stint at IBM, Jim was an analyst at Forrester Research, Current Analysis, and the Burton Group. He is also a prolific blogger, a popular speaker, and a familiar face from his many appearances as an expert on theCUBE and at industry events.

Since Syncsort recently joined the Hyperledger community, we have a clear interest in raising awareness of the Blockchain technology. There’s a lot of hype out there, but not a lot of clear, understandable facts about this revolutionary data management technology. Toward that end, Syncsort’s Integrate Product Marketing Manager, Paige Roberts, had a long conversation with Wikibon Lead Analyst Jim Kobielus.

In this first part of that conversation, we discussed the basic definition of what the Blockchain is, and cut through some of the hype surrounding it.

Roberts: Tell us a little about yourself.

Kobielus: I’m James Kobielus. I’m the lead analyst at Wikibon. I’m a veteran analyst covering data analytics, artificial intelligence and cloud data computing, and one of my research focus areas is Blockchain. In fact, I plan to write and publish a Wikibon research document on its maturation in the enterprise some time in the next few months.

Ah, good timing for the interview then. Let’s start with the basic definition. What exactly is the Blockchain?

Blockchain was defined initially by the legendary inventor of Bitcoin, Satoshi Nakamoto, which is not really his name, just a pseudonym. Blockchain is not a currency, rather it is a distributed, trusted hyper ledger. It’s essentially a database, but the architecture is distributed and can be stored on dozens, or hundreds, or thousands of separate computers that remain in synchronization with each other. The hyper ledger of data is stored in a secure fashion where everybody can read the Blockchain, and nobody can repudiate that they made an update to it because there’s a trust mechanism built-in. And the Blockchain cannot be changed. It’s immutable. Once you write something to a Blockchain, it cannot be deleted, it cannot be edited, so it’s a very specialized type of distributed database. In other words, where your traditional databases enable you to do what we often call CRUD operations; create, read, update, and delete the data. Blockchain only allows you to create and update the data by adding to it. You can also read it, but you can’t delete it. So, it’s specialized to a variety of applications that don’t require full CRUD semantics.

Okay, that makes good sense. CRUD operations are pretty familiar to anybody in the database space. How is the Blockchain different from other databases?

So, first of all, it’s not different in any radical sense from a number of approaches that have been around for a while now. There are plenty of distributed databases in the world from various vendors that use a variety of approaches to split the data into separate tables, or volumes, with varying degrees of synchronization across different servers. There are approaches in traditional relational databases such as sharding that enable the datasets to be distributed across many nodes.

What makes Blockchain different is that it is primarily for logging data for a secure, trusted record of transactions. Nobody can deny that they posted something because there is a complete audit trail in the updates that were made to the chain. There’s a distributed trust mechanism built into it that you don’t necessarily see in other data platforms or distributed data environments as an embedded capability.

Blockchain also is not limited in the types of data it can store, like say, a relational database is limited to storing structured data in structured tables. It can store pretty much any type of data within the blocks themselves. The term “block” actually has a real meaning in the Blockchain architecture. The data blocks can store textual data, video objects, application code or whatever you have. So, it’s quite versatile. It is a database that can store unstructured, multi-structured data, in addition to structured data.

Blockchain is open source. There are a lot of open source databases, of course. It was originally incorporated into Bitcoin and it’s still the foundation for Bitcoin, and for most cryptocurrencies, but Blockchain has evolved independently of the currencies. Using Blockchain doesn’t necessarily imply that it’s supporting a cryptocurrency application. It could be potentially supporting many kinds of applications.

There’s a core open source distribution, and there are various forks to that distribution, such as for the Hyperledger foundation. Hyperledger is an industry group that manages core Blockchain open source code. There is also the Ethereum Project managing other forks.

Syncsort also sees Blockchain as important to our customers going forward. We recently joined Hyperledger so that we can help contribute to it, like we did in the early days of Hadoop and Spark. Blockchain has a lot of hype around it, though. One of the biggest things we’re trying to do is see what is hype and what is reality. Why do you think Blockchain has been riding so high on the hype train?

Hype serves an important purpose which is to raise people’s awareness and understanding of particular things. Usually in a marketing context, if you want to sell products, you have to make people aware of it and what it can do. Everyday technology has a hype cycle so, what you have to do if you’re a buyer is get down to exactly what the product does, what differentiates it from other approaches. How mature is this technology? Is it a stable code base? Are there standards? How widely is it adopted? How tested is it? Is there an ecosystem around it?

Blockchain has actually been around for about 10 years. Over that time, it’s grown in a lot of ways, one of which is in its tie to cryptocurrencies and the media around that. It has raised awareness of the Blockchain with a lot of business people, technical people and even consumers.

It’ll take a couple of years for a general understanding across Blockchain and the technologies related to it to really get to a point where people are as familiar with Blockchain as they are now with something like mobile computing. So, the awareness will take a while. Also, it will take a while for the startup community to catch up. There are a LOT of startups, but none of them have really taken off yet. I could list some names, but they’re all unfamiliar to most people, even technical people.

Ten years ago when Hadoop got started, there were a bunch of startups, and a few rose above the rest, and built substantial businesses based on Hadoop: Cloudera, Hortonworks, MapR and a few others. There is no equivalent, familiar brand, yet, that’s focused on Blockchain as a platform vendor. For this space to mature, for us at Wikibon to consider it mature, there needs to be a few of these startups that rise above the pack and survive. An enterprise IT professional needs to know that these companies will be around in a few years.

Also, many of the big, established IT vendors have already stepped in with their own Blockchain products and services. I mean, IBM certainly does. AWS launched their own platform over the past year or so. So has Microsoft, Oracle, and so has VMware.

What I’m getting at is all of these established IT vendors are starting to test the waters in terms of the Blockchain market with tech solutions and cloud services. None of them has had runaway success in terms of Blockchain platform, in terms of adoption. None have become the de facto standard either. We haven’t even gotten to the point where the M & A in this space has picked up.

The hype is very much in advance of the actual maturation in the shakeout of the Blockchain space

Be sure to check out Part 2 of this conversation where we deep dive into the real practical value of the Blockchain and some of the business use cases where it shines.

Jim is Wikibon’s Lead Analyst for Data Science, Deep Learning, and Application Development. Previously, Jim was IBM’s data science evangelist. He managed IBM’s thought leadership, social and influencer marketing programs targeted at developers of big data analytics, machine learning, and cognitive computing applications. Prior to his 5-year stint at IBM, Jim was an analyst at Forrester Research, Current Analysis, and the Burton Group. He is also a prolific blogger, a popular speaker, and a familiar face from his many appearances as an expert on theCUBE and at industry events.

At the recent DataWorks Summit in San Jose, Paige Roberts, Senior Product Marketing Manager at Syncsort, had a moment to speak with Dr. Sourav Dey, Managing Director at Manifold.

In the first part of our three-part interview Roberts spoke to Dey about his presentation which focused on applying machine learning and data science to real world problems. Dey gave two examples of matching business needs to what the available data could predict.

In part two, Dey discussed augmented intelligence, the power of machine learning and human experts working together to outperform either one alone.

In this final installment Roberts and Dey speak about the importance of data quality and entity resolution in machine learning applications.

Roberts: In your talk, you gave an example where you tried two different machine learning algorithms on a data set, and didn’t get good results either time. Rather than trying yet another, more complicated algorithm, you concluded that the data wasn’t of good quality to make that prediction. What quality aspects of the data affect your ability to use it for what you’re trying to accomplish?

Dey: That’s a deep question. There are a lot of things.

Let’s dive deeper then.

So, at the highest level, there’s the quantity of data. You can’t do very good machine learning with only a handful of examples. Ideally you need thousands of examples. Machine learning is not magic. It’s about finding patterns in historical data. The more data, the more patterns it can find.

People are sometimes disappointed by the fact that if we’re looking for something rare, they may not have very many examples of it. In those situations, machine learning often doesn’t work as well as desired. This is often the case when trying to predict failures. If you have good dependable equipment, failures are often very rare – occurring only in a small fraction of the examples.

There are techniques, like sample rebalancing that can address certain issues with rare events, but fundamentally more examples will lead to better performance of the ML algorithm.

What are other issues to be aware of?

Another aspect, of course, is the data labeled well? Tendu talked about this, too, in her talk on anti-money laundering. Lineage issues are a problem. Things like, oh, actually, the product was changed here, but I never noted it. That means that all of these features have changed. This comes up a lot, particularly with web and mobile-based products where the product is constantly changing. Often such changes mean that a model can’t be trained on data before the change because it is no longer a good proxy for the future. Labeling is one of the biggest issues. I gave you the example for the oil and gas where they thought they had good labeling, but they didn’t.

How about missing data?

Missing data is surprisingly not that big of an issue. In the oil and gas sensor data, it could drop off for a while because of poor internet connectivity. For small dropouts, we could interpolate using simple interpolation techniques. For larger dropouts we would just throw out the data. That’s much easier to deal with than labelling issues.

Can you talk a bit about entity resolution and joining data sources?

Yes, this is another problem we often face. The issue is about joining data sources, particularly with bigger clients. They’ll have three silos, seven silos, ten silos, sometimes in really big companies even have 50 or 100 silos of data, where they’ve never been joined, but they’re of the same user base.

The data are all about the same people.

Right, and even within a single data source, it needs to be de-duplicated. It’s the same records. I’ll give a concrete example. We worked with this company that is an expert search firm. Their business is to help companies to find specific people with certain skills, e.g. a semi-conductor expert that understands 10 nanometer micron technology. Given a request, they want to find a relevant expert as fast as possible.

Clean, thick data drives business value for them by giving their search a large surface area to hit against. They can then service more requests, faster. Their problem was that they had several different data silos and they never joined them. They only searched against one. They knew that they were missing out on a lot of potential matches and leaving money on the table. They hired Manifold to help them solve this problem.

How do we join these seven silos, and then figure out if the seven different versions of this person are actually the same person? Or two different people, or five different people.

This problem is called entity resolution. What’s interesting, is that you can use machine learning to do entity resolution. We’ve done it a couple of times now. There are some pretty interesting natural language processing techniques you can use, but all of them require a human in the loop to bootstrap the system. The human labels pairs, e.g. these records are the same, these records are not the same. These labels are fed back to the algorithm, and then it generates more examples. This general process is called active learning. It keeps feeding back the ones it’s not sure about to get labelled. With a few thousand labeled examples, it can start doing pretty well for both the de-duplication and the joining.

That’s a challenge, yeah. One of the tricks is to use a blocking algorithm which is crude classifier. Then, after the blocking, you have a much smaller set to do the machine learning base comparison on. That being said, even the blocking has to be run on N times M records where N and M are millions of records.

Where if you have seven silos and there’s a million records each and a hundred attributes per record, it’s a million times a million seven times …

It’s blows up quickly. That’s where you have to be smart about parallelizing and I think that’s where the Syncsort type of solution can be really powerful. It is an embarrassingly parallel problem. You just have to write the software appropriately so that can be done well.

At the recent DataWorks Summit in San Jose, Paige Roberts, Senior Product Marketing Manager at Syncsort, had a moment to speak with Dr. Sourav Dey, Managing Director at Manifold. In the first part of our three-part interview Roberts spoke to Dr. Dey about his presentation which focused on applying machine learning to real world requirements. Dr. Dey gave two examples of matching business needs to what the available data could predict.

Here in part two, Dr. Dey discusses augmented intelligence, the power of machine learning and human experts working together to outperform either one alone. In particular, AI as triage is a powerful application of this principle, and model explainability is the key to making it more useful.

Roberts: One of the big themes I’m seeing here, what the keynote talked about this morning, is that the best chess machine can beat the best human chess player, but both can be beaten by a mediocre chess player with a really good chess program working together. One of the things you talked about was that kind of cooperation between people and machines speeding up triage, and how that works.

Dey: Yeah, so this is what many people call augmented intelligence. I would say almost 50% or more of the projects that we do at Manifold fall into the business pattern that I call “AI as Triage”. The predictions that AI is doing helps to triage a lot of information that a single human can’t process. Then, the AI presents it in a way that a human can make a decision on it. That’s a theme that I’ve seen over and over again. Both of the examples I gave before fit that, for instance.

In the baby registry example, our client was collecting all of these signals that no single human can understand, all the web clicks, mobile clicks, marketing data, etc. The AI is triaging that and distilling it down so that a marketing person or the product person can make decisions on it.

In the oil and gas company example, it’s the same. The machines are generating fine-tick data from 54 sensors from thousands of locations across the country, no person (or even team of people) can look at that all the time.

Nobody can make sense of that.

Yeah, but the AI can crush it down, and present it to humans in an actionable way. That can really speed up that triage process. So that’s the goal there.

I was impressed by one example you mentioned. You have these decision trees making a decision that something would fail, and that was kind of useful. But the person still had to figure out from scratch why it would fail, and how to repair it. Whereas, if the AI explained … how was that done?

The TreeSHAP algorithm, yeah. It explains how a decision tree came to a particular decision. It’s relatively recent that people are doing some good research into this. Essentially, there is the model that’s making the prediction. Then, you can make another model of that model that explains the original model. It tells you why it made that prediction.

That WHY can be key.

There have been a few competing techniques out there. All of them had some issues, but this group at the University of Washington, inspired by game theory from economics, they made a consistent explanation. It’s called the Shapley Metric. What’s nice, is that they developed a fast version of it that can be used with tree-based models called TreeSHAP. It’s fantastic. We use it all the time now for explanations of why the model is making a particular individual prediction. For instance, today, you predicted .91 probability of failure. Why? You could also use it at the aggregate level, for something like: On the whole, thousands of machines over five years, what was importance of this feature in making the prediction?

And then the person going to repair that equipment knows WHY it was predicted to fail, and therefore has a pretty good idea of what they have to fix.

Well at least they have a much better idea. The maintenance engineers have a web app that they can then dig deeper into looking at the historical time series. In addition, they can VPN into the physical machine. All in all, the explainable model allows them to do triage much faster, and, in turn, do the repair more quickly.

Model explainability is incredibly useful for a lot of things. I know Syncsort has been doing a lot of work around GDPR, and I talked to a data scientist in Germany, Katharine Jarmul about this. For example, if a person wants a loan, and you’ve got a machine learning model that says no, you can’t have that loan, you have to be able to explain why.

Totally, yeah. There are laws about that for important civil rights reasons.

For what you’re doing, the reasons are less legal and more practical. If I’m going to use this prediction in order to take an action, such as a repair, it helps a lot if I know how the prediction was reached.

I can give another example of that. We did work for a digital therapeutics company. They make an app along with wearables that helps people get their diabetes under control. We were making predictions of whether or not, in 24 weeks, is the patient’s blood sugar going to go below a certain level. There’s a human in the loop, a human coach that you get as a part of this program. They didn’t know what to do with the raw prediction probability. When we put in an explainable algorithm, that let them know why that number was high or low, they could have much better phone calls with the patients.

Because they knew WHY the blood sugar was likely to dip.

They could say things like, hey, I see that you’re not doing this food planning very much, or you haven’t logged into the app in a while. You used to log in seven times a week. What’s going on? They have the knowledge ready to have a high bandwidth interaction with the patient.

So, I think there’s a lot there.

The more I learn about model explainability, the more I see where it’s hugely useful.

There are a lot of folks doing cool things with deep learning. It’s far harder to explain, but there’s work being done on that. Hopefully, in the next few years, there will be better techniques to explain those more complex models as well.

Tune in for the final part of this interview where Roberts and Dey speak about the effect of data quality as well as Entity Resolution in conjunction with machine learning.

At the recent DataWorks Summit in San Jose, Paige Roberts, Senior Product Marketing Manager at Syncsort, had a moment to speak with Dr. Sourav Dey, Managing Director at Manifold. In the first of this three part interview Roberts spoke to Dr. Dey about his presentation which focused on applying machine learning to real world requirements. Dr. Dey gives two examples of matching business needs to what the available data can predict.

Roberts: So, let’s get started! Can you introduce yourself and where you work, and what you do?

Dey: My name is Sourav Dey. I’m a Managing Director at Manifold. We’re an AI engineering services firm that focuses on accelerating AI projects for high growth and Fortune 500 companies. I did my Ph. D. in Computer Science at MIT, and I’ve been doing algorithms engineering for the last 10 years, both real-time algorithms on embedded processors and big data algorithms in the cloud. Manifold does both, but most of our work has been on the latter. We build custom AI solutions for our clients.

You were a data scientist before data scientists were cool!

Yes, yes, before it was the thing. (both laughing) But I like the term. Yeah, I’m a data scientist. Though that seems to be losing vogue, and “machine learning engineer” is the new hotness. I’m fine with that, too.

You’re still doing the same thing, whatever folks want to call it this week. So, I know you just did your talk. You gave some interesting case studies, some great machine learning implementation examples. Would you like to talk about a couple of the examples from that?

Yeah. I gave two examples in my talk. One was about work we did for one of the leading baby registries in the US. We helped them with go from unstructured desires to a clear engineering spec that could be built.

You condensed business need down to something you could build?

Exactly. I think that one of the key takeaways from the talk was you have to learn about the business needs by getting in the customers shoes as well as understand their data. It’s the marriage of the two where you can come up with the spec that an engineer can build to. Doing that is much of the challenge. For example, by doing a deep dive into their business we learned that this baby registry company wanted to make decisions faster. Because of the nature of their business, it would take nine months for any marketing or product experiment to have the final measured output that they could then make a decision on. That is far too long in the age of the Internet.

So we make them a set of predictive models that would predict what is likely going to happen nine months later, after a day, after 2 days, after 7 days, after 30 days, etc. We’ve deployed these machine learning models to production, and now they’re able to make decisions much more quickly, because that model is very accurate. They are able to make decisions on marketing campaigns and product changes much more rapidly by doing AB tests and looking at the model output. Before, they were using heuristics to make their decisions. Now they can make more data-driven decisions, more rapidly.

Okay. So capturing the business need was a big piece of that. No matter how good your machine learning is, or your AI predictions are, at the end of the day, if they don’t meet the business need, and if the business need isn’t big enough to give a good ROI, it’s not worth bothering with the project.

Totally.

So that was the first big point I got from your presentation. The second one was matching the business need up with the data. It might be extremely important and very valuable to the business to be able to make a prediction, but if they don’t have the data to support that prediction, you’ve got a problem. Can you talk about that one a little?

Yeah, the baby registry example was a good, positive outcome. The second example was a much more challenging data problem. Our client was an oil and gas company that wanted to make their maintenance operations more efficient. The dream was, “If I can predict when these machines are going to fail, I can turn unplanned maintenance into planned maintenance”. They had two major data sets. One was a sensor data set, one-minute tick samples of many different sensors coming off of the machines into their cloud. The other major data set was human entered maintenance logs in their workflow software. It asked documented what parts their replaced, how long they worked, and had a lot of freeform notes.

Early on, during our data audit phase, we found that a lot of that human-generated data was very untrustworthy. It just wasn’t captured in a way that we could get good value out of it. There were five to seven different types of failures that they were particularly interested in. It turns out these failures were not labelled well in the maintenance logs because the root causes was not documented. Also, the way it was captured changed over the five years of history that they had. There was no good way to label the historical failures, and see that a specific thing failed at this specific time because of this reason.

They didn’t record the root cause. They just recorded that they replaced a specific part at a specific time.

Correct, and you could replace those parts for various reasons. An expert, maybe, could go back and figure out what happened there, but retroactively labelling the data would be costly and slow.

We ended up having them, going forward, capture the root cause analysis, improving the data with clean labels going forward. But it wasn’t in the historical data they has already captured. That’s why, we were unable to deliver the dream as they originally envisioned it.

But, what we were able to do is predict a different class of failures using purely the sensor data. That data is much more trustworthy without the data lineage issues of the maintenance logs. We ended up focusing on major faults where the machine went off line for over two hours. We were able to create a successful predictive model using the historical sensor data to predict these faults. This was useful to their maintenance operations, but at the same time, many faults that are not as interesting to the business are caught with this definition of failure. It’s always a trade-off. This was the best we could do with the data that we had.

Come back for part two where Roberts and Dey discuss augmented intelligence, AI as Triage, and the importance of model explainability.

2018 has been a big year for expert interviews so far. We’ve spoken with professionals in Big Data, Machine Learning, Artificial Intelligence, the Cloud, and more! Here are just a few of the most popular experts that we’ve spoken to this year.

As the co-founder and Chief Strategy Officer of Cloudera, Mike Olson (@mikeolson) had just given a presentation on Machine Learning and Cloud Adoption in organizations today.

Shortly after, Paige Roberts (@RobertsPaige) of Syncsort spoke with him and got this great three-part interview which includes topics such as Machine Learning, Cloud, the Gartner Hype Cycle, and Women in Technology.

Follow along for the whole series!

Tony Baer (@TonyBaer), the Principal Analyst at Ovum, focuses on Big Data and why it should be at the center of attention in businesses around the world

Baer spoke to us about trends in Big Data, including what he believes that the future of Hadoop and Cloud data will hold.

Read about it more below.

Data Scientist and founder of KDnuggets News, Gregory Piatetsky-Shapiro, has been published over 60 times, including over 10,000 citations on data mining and knowledge discovery

In this two-part interview, Shapiro speaks to us about the advances in Artificial Intelligence and the approach that businesses should take dealing with it.

Check out both parts in the links!

Tobi Bosede (@AniTobiB) is a Senior Machine Learning Engineer and had just given a presentation when Syncsort’s Paige Roberts (@RobertsPaige) had a moment to speak with her.

Bosede covered a wide range of topics from having a career in Machine Learning to what it’s like as a double minority in the technical field.

Read all 3 parts to find out the details!

We round out the first half of the year with James Kobielus (@jameskobielus) who is the SiliconANGLE Wikibon Lead Analyst for Data Science, Deep Learning, and Application Development.

In a two part interview, Kobielus explains how Machine Learning and Artificial Intelligence are impacting the world today and why there is no need to fear.

Discover more in the links below!

Don’t forget to visit our blog for even more great expert interviews and the latest in everything else.

Christopher Tozzi

Recently, while Elise was working with NPR, they discussed the fact that episodes of NPR programs posted online did not provide captions. While these shows generally have an article associated with them or a transcript of the conversation, Elise pointed out that NPR might be filtering out a significant portion of the population who might have hearing loss but are still able to appreciate an audio-centered show. Or, those who were completely deaf who liked the pacing captions brought and a less cluttered visual experience.

Because of their conversation, NPR has a better understanding of an entire market they might be missing out on.

Her way of problem-solving is catching on.

“A couple years ago I was telling people about human centered design, they had no idea what I was talking about,” Elise says. “But now they’re starting to recognize the value it provides businesses and starting to see how they can create more targeted responsive solutions.”

Big Data plays an important role in creating more customer-centric solutions. It allows organizations to better understand how to react to the human experience and build more personalized and customized experiences and identify patterns that otherwise might have been difficult to see.

Currently, one of the biggest struggles with integrating the perspective of people with disabilities is that there are such a wide variety of disabilities– it can be challenging to design with each one in mind.

Elise says Big Data can help overcome those challenges.

There are already products on the market that benefit individuals with disabilities that use the power of Big Data and the Internet of Things.

For instance, there are companies developing doorbell home security solutions that alert users to motion and allow them to monitor the door remotely– an ideal solution for individuals with mobility problems. Innovation like this and others including the Roomba or self-driving cars not only make it easier for people with disabilities to live independently but are also products that the general population enjoys as well.

In order to continue to bring innovations like these to market, it will be essential that Big Data be paired with human centered design methods.

“This is because big data can easily be influenced by bias,” Elise says. “For example, we could only collect certain kinds of data and be missing out on a key thing that would get uncovered through the human centered design process during the observation phase.”

Recently, Microsoft hired several experts in bias reduction in Artificial Intelligence when they recognized their AI applications were biased in the sense that they were designed around the beliefs of those who were designing them rather than the people who were going to experience their applications.

Moving forward, Elise believes there needs to be symbiosis between Big Data and the human aspect of design.

Elise’s consulting business is still in its infancy, but she’s excited about potential impact on innovation that of looking at innovation through the lens of the disabled offers for businesses.

“There’s a lot of people who have gotten back to me and said it’s really impacted how they’re thinking about things,” Elise says.