Category: data quality

Founded in 1993, Trillium Software has been the largest independent data quality vendor for some years, nestling since the late 1990s as a subsidiary of US marketing services company Harte Hanks. The latter was once a newspaper company dating back to 1928, but switched to direct marketing in the late 1990s. It had overall revenues of $495 million in 2015. There was clearly a link between data quality and direct marketing, since name and address validation is an important feature of marketing campaigns. However the business model of a software company is different from a marketing firm, so ultimately there was always going to be a certain awkwardness in Trillium living under the Harte Hanks umbrella.

On June 7th 2016 the parent company announced that it had hired an advisor to look at “strategic alternatives” for Trillium, including the possibility of selling the company, though the company’s announcement made clear that a sale was not a certainty. Trillium has around 200 employees and a large existing customer base, so will have a steady income stream from maintenance revenues. The data quality industry is not the fastest growing sector of enterprise software, but is well established and quite fragmented. As well as offerings from Informatica, IBM, SAP and Oracle (all of which were based on acquisitions) there are dozens of smaller data quality vendors, many of them having grown up around the name and address matching issue that is well suited to at least a partially automated solution. While some vendors like Experian have focused traditionally on this problem, other vendors such as Trillium have developed much broader data quality offerings, with functions such as data profiling, cleansing, merge/matching, enrichment and even data governance.

There is a close relationship between data quality and the somewhat faster growing sector of master data management (MDM), so MDM vendors might seem in principle to be natural acquirers of data quality vendors. However MDM itself has somewhat consolidated in recent years, and the big players in it like Informatica, Oracle and IBM all market platforms that combine data integration, MDM and data quality (though in practice the degree of true integration is distinctly more variable than it appears on Powerpoint). Trillium might be too big a company to be swallowed up by the relatively small independents that remain in the MDM space. It will be interesting to see what emerges from this exercise. Certainly it makes sense for Trillium to stand on its own to feet rather than living within a marketing company, but on the other hand Harte Hanks may have missed the boat. A few years ago large vendors were clamouring to acquire MDM and related technologies, but now most companies that need a data quality offering have either built or bought one. The financial adviser in charge of the review may have to be somewhat creative in who it looks at as a possible acquirer.

At a conference in Lausanne in June 2014 SAS shared their current business performance and strategy. The privately held company (with just two individual shareholders) had revenues of just over $3 billion, with 5% growth. Their subscription-only license model has meant that SAS has been profitable and growing for 38 years in a row. 47% is Americas, 41% from Europe and 12% from Asia Pacific. They sell to a broad range of industries, but the largest in terms of revenue are banking at 25% and government at 14%. SAS is an unusually software-oriented company, with just 15% of revenue coming from services. Last year SAS was voted the second best company globally to work for (behind Google), and attrition is an unusually low 3.5%.

In terms of growth, fraud and security intelligence was the fastest growing area, followed by supply chain, business intelligence/visualisation and cloud-based software. Data management software revenue grew at just 7%, one of the lowest rates of growth in the product portfolio (fraud management was the fastest growing). Cloud deployment is still relatively small compared to on-premise but growing rapidly, expected to exceed over $100 million in revenue this year.

SAS has a large number of products (over 250), but gave some general update information on broad product direction. Its LASR product, introduced last year, provides in-memory analytics. They do not use an in-memory database, as they do not want to be bound to SQL. One customer example given was a retailer with 2,500 stores and 100,000 SKUs that needed to decide what merchandise to stock their stores with, and how to price locally. They used to analyse this in an eight-hour window at an aggregate level, but can now do the analysis in one hour at an individual store level, allowing more targeted store planning. The source data can be from traditional sources or from Hadoop. SAS have been working with a university to improve the user interface, starting from the UI and trying to design to that, rather than producing a software product and then adding a user interface as an afterthought.

In Hadoop, there are multiple initiatives to apply assorted versions of SQL to Hadoop from both major and minor suppliers. This is driven by the mass of skills in the market with SQL skills compared to the relatively tiny number of people that can fluently program using MapReduce. Workload management remains a major challenge in the Hadoop environment, so a lot of activity has been going on to integrate the SAS environment with Hadoop. Connection is possible via Hive QL. Moreover, SAS processing is being pushed to Hadoop with Map Reduce rather than extracting data. A SAS engine is placed on each cluster to achieve this. This includes data quality routines like address validation, directly applicable to Hadoop data with no need to export data from Hadoop. A demo was shown using the SAS Studio product to take some JSON files, do some cleansing, and then use Visual Analytics and In-Memory Statistics to analyze a block of 60,000 Yelp recommendations, blending this with another recommendation data set.

A further element of consolidation in the data management occurred when Oracle purchased Datanomic, a data quality company based in Cambridge (for a change, the original one in England rather than the one near Boston). Datanomic has been an interesting story, set up in 2001 and bringing to the market a well-rounded data quality product. This is a crowded market, and in the dreadful conditions for enterprise software that occurred after the market crash in 2001 the company initially struggled. There were, after all, an awful lot of data quality products out there that people had already heard of. Then Datanomic did a very smart thing and re-positioned itself to focus on a business rather than a technical issue: compliance, especially in financial services.

This turned out to be an inspired change of marketing strategy, and the company went from layoffs to hiring again, growing rapidly over the last three years, far in excess of the 9% annual rise in the general data quality market that has been seen recently. Datanomic has had positive customer references in our regular annual surveys, and it seems to me a well-architected solution. From Oracle’s point of view, this complements their purchase of Silver Creek, which was a specialist product data quality tool. These two acquisitions suggest that Oracle is changing its view of data quality – previously they relied on partner arrangements with companies such as Trillium for their data quality solution. Now it would appear that they see data quality as a more integral issue. The price of the deal was not disclosed, but given Datanomic’s rapid recent growth, it will have doubtless been at a healthy premium.

This week I will be delivering the keynote speech at the IDQ Data Governance Conference in San Diego (funny how they never hold technology conferences in Detroit or Duluth). This promises to be an excellent event, with over 350 registered attendees, and plenty of movers and shakers in this emerging field. Data governance is the business-led strand that is beginning to bring together the hitherto curiously separate worlds of MDM and data quality, and it will be interesting to see what leading end-user companies are doing in this field.

Get a discount to the upcoming data governance conference in San Diego.

In early June there is the annual Data Governance Conference:

http://www.debtechint.com/dg2010/

which this year is in the attractive setting of San Diego (the place with perhaps the best climate in the USA). Naturally as a conference delegate you will be influenced solely by the agenda and the speaker quality rather than the prospect of a sunny location, but I just thought I’d mention it.

There will be some excellent speakers, and also me giving the keynote. As a reader of this blog I am happy to offer you a discount should you be able to attend. Just quote the following code when booking: IDDG100 – please be aware that this code expires on May 7th.

I read a very interesting article today by independent data architecture consultant Mike Lapenna about ETL logic. Data governance initiatives, MDM and data quality projects are all projects which need business rules of one kind or another. Some of these may be trivial, and as much technical than business e.g. “this field must be an integer of most five digits, and always less than the value 65000”. Others may be more clearly business-oriented e.g. “customers of type A have a credit rating of at most USD 2,000” or “every product must be part of a unique product class”. Certainly MDM technologies provide repositories where such business rules may be stored, as (with a different emphasis) do many data quality repositories. Some basic information is stored within the database systems catalogs e.g. field lengths and primary key information. Databases and repositories are generally fairly accessible, for example via a SQL interface, or some form of graphical view. Data modeling tools also capture some of this metadata.

Yet there is a considerable source of rules that are obscured from view. Some are tied up within business applications, while there is another class that are also opaque: those locked up within extract/transform/load ETL rules, usually in the form of procedural scripts. If several source files need to be merged, for example to load into a data warehouse, then the logic which defines what transformations occur are important rules in their own right. Certainly they are subject to change, since source systems sometimes undergo format changes, for example if a commercial package is upgraded. Yet these rules are usually embedded within procedural code, or at best within the metadata repository of a commercial ETL tool. Mike’s article proposes a repository that would keep track of the applications, data elements and interfaces involved, the idea being to get the rules as (readable) data rather than buried away in code.

The article raises an important issue: rules of all kinds concerning data should ideally be held as data and so be accessible, yet ETL rules in particular tend not to be. It is beyond the scope of the article, but for me there is a question of how the various sources of business rules: ETL repository, MDM repository, data quality repository, database catalogs etc can be linked together so that a complete picture of the business rules can be seen. Those with long memories will recall old fashioned data dictionaries, which tried to perform this role, but which mostly died out since they were always essentially passive copies of the rules in other systems, and so easily became out of data. Yet the current trend towards managing master data actively raises questions about just what the scope of data rules should be, and where they should be stored. Application vendors, MDM vendors, data quality vendors, ETL vendors and database vendors will each have their own perspective, and will inevitable will each seek to control as much of the metadata landscape as they can, since ownership of this level of data will be a powerful position to be in.

From an end user perspective what you really want is for all such rules to be stored as data, and for some mechanism to access the various repositories and formats in a seamless way, so that a complete perspective of enterprise data becomes possible. This desire may not necessarily be shared by all vendors, for whom control of business metadata is power. An opportunity for someone?

We have now completed our survey of data quality. Based on 193 responses from IT and business staff from around the world, there were some very interesting findings. Amongst these was that 81% of respondents felt that data quality was much more than just customer name and address, which is the focus of most of the vendors in the market. Moreover, customer name and address data ranked only third in the list of data domains which survey respondents found most important. Both product and financial data was felt to be more important, yet product data is the focus of barely a handful of vendors (Silver Creek, Inquera, Datactics) while of all the dozens of data quality vendors out there, few indeed focus on financial data. Name and address is of course a common issue and conveniently is well structured and has plenty of well-established algorithms out there to attack it. Yet surely the vendor community is missing something when customers rate other data types as higher in importance?

Another recurring theme is the lack of attention given to measuring the costs of poor data quality. Lots of respondents fail to make any effort to measure this at all, and then complain that it is hard to make a business case for data quality. “Well duh”, as Homer Simpson might say. Estimates given by survey respondents seemed very low when compared to our experience, and also to anecdotes given in the very same survey. One striking one was this: “Poor data quality and consistency has led to the orphaning of $32 million in stock just sitting in the warehouse that can’t be sold since it’s lost in the system.” This company at least has no difficulty in justifiying a data quality initiative. The survey had plenty of other interesting insights too.

The full survey and analysis, all 33 pages of it, can be purchased from here.

Informatica buys Address Doctor, sowing uncertainty among those who license its data.

Most data quality vendors have their roots in name and address checking, even if their software can go beyond this. What is less well known is that the actual business of getting street level address data (to verify postal codes etc) is a tedious business that varies dramatically by country (the UK post office database covers almost every address in the UK, but Eire has no post code system, for example). Software vendors do not typically want to be in the business of updating street address databases, and there is a patchwork of local information providers that fill the gaps. If you have any international aspirations, though, just discovering who does what by country, and licensing the various data sources is in itself a non-trivial task, and so companies exist that do this. One was a UK company called Global Address, bought some time ago by Harte Hanks (who market Trillium), while the other was Address Doctor. Many data quality vendors use Address Doctor, including some that might superficially appear to compete. These include Dataflux, IBM, and even QAS. Some MDM platform vendors also use Address Doctor, who provide at least basic name and address data for 240 countries and territories.

The cat was put firmly among st the pigeons this week when Informatica bought Address Doctor. From their viewpoint this secures a key provider of address data, and follows their prior acquisitions of Similarity Systems and, more recently, Identity Systems. Informatica, via these purchases, has established itself as one of the major data quality vendors. Given its competitive position, the data quality vendors who use Address Doctor will, at the least, be feeling nervous. I spoke to an executive from Informatica this week and was told that Informatica intended to honour the existing arrangements, but who knows how long this state will last? As Woody Allen said, the lion may lay down with the lamb, but the lamb won’t get much sleep.

The problem for the other vendors is that there is no obvious place to go. Global Address is already in the hands of Harte Hanks, and while Uniserv in particular has its own name and address data, it is mainly strong in this area in Europe. Address Doctor was a convenient neutral player and is now in the hands of a major market competitor, and other vendors may have little choice but to look at building up their own networks of address data providers if they are to sleep easy. Of course it is not clear that they have to worry; for example Pitney Bowes Business Insight (who have what was Group 1 software) use Global Address, and this arrangement has continued without incident despite Harte Hanks Trillium’s ownership of them.

It will be interesting to see what measures the current Address Doctor users take, or whether they will just cross their fingers and hope Informatica plays nice.

I have recently been spending some time looking at the data quality market, and a few things seem to pop up time and again. The first thing is, in talking with customers, just how awful the quality of data really is within corporate systems. One major UK bank found 8,000 customers whose age was over 150 according to their systems. All seemingly academic (if you are taking money out of your account, who cares what your age is?) until some bright spark in marketing decided that selling life insurance to these customers would be a fine idea.

Story after story confirms some really shocking data errors that lurk beneath most operational systems. These are the same operational systems that are used to generate data for the end-year accounts which senior executives happily sign off on pain of jail-time these days. I hope no one shows these sames execs the data inside some of these systems, or they might start to get very nervous indeed.

Yet in a survey we did last year, only about a third of companies in the survey have invested in data quality tools at all! Does anyone else find this in any way scary? Do you have any entertaining data quality stories you can share?

Andy Hayler

Andy Hayler is a passionate and outspoken commentator on the enterprise software market. A 20-year veteran of data modelling, warehousing and integration projects, he was named a Red Herring Top 10 Innovator in 2002 for founding Kalido – an innovative information management company that provides customers with the ability to dynamically view the impact of business changes. The views expressed on this blog are Andy’s own, and do not necessarily reflect the views of The Information Difference.