Category: Metadata

I read a very interesting article today by independent data architecture consultant Mike Lapenna about ETL logic. Data governance initiatives, MDM and data quality projects are all projects which need business rules of one kind or another. Some of these may be trivial, and as much technical than business e.g. “this field must be an integer of most five digits, and always less than the value 65000”. Others may be more clearly business-oriented e.g. “customers of type A have a credit rating of at most USD 2,000” or “every product must be part of a unique product class”. Certainly MDM technologies provide repositories where such business rules may be stored, as (with a different emphasis) do many data quality repositories. Some basic information is stored within the database systems catalogs e.g. field lengths and primary key information. Databases and repositories are generally fairly accessible, for example via a SQL interface, or some form of graphical view. Data modeling tools also capture some of this metadata.

Yet there is a considerable source of rules that are obscured from view. Some are tied up within business applications, while there is another class that are also opaque: those locked up within extract/transform/load ETL rules, usually in the form of procedural scripts. If several source files need to be merged, for example to load into a data warehouse, then the logic which defines what transformations occur are important rules in their own right. Certainly they are subject to change, since source systems sometimes undergo format changes, for example if a commercial package is upgraded. Yet these rules are usually embedded within procedural code, or at best within the metadata repository of a commercial ETL tool. Mike’s article proposes a repository that would keep track of the applications, data elements and interfaces involved, the idea being to get the rules as (readable) data rather than buried away in code.

The article raises an important issue: rules of all kinds concerning data should ideally be held as data and so be accessible, yet ETL rules in particular tend not to be. It is beyond the scope of the article, but for me there is a question of how the various sources of business rules: ETL repository, MDM repository, data quality repository, database catalogs etc can be linked together so that a complete picture of the business rules can be seen. Those with long memories will recall old fashioned data dictionaries, which tried to perform this role, but which mostly died out since they were always essentially passive copies of the rules in other systems, and so easily became out of data. Yet the current trend towards managing master data actively raises questions about just what the scope of data rules should be, and where they should be stored. Application vendors, MDM vendors, data quality vendors, ETL vendors and database vendors will each have their own perspective, and will inevitable will each seek to control as much of the metadata landscape as they can, since ownership of this level of data will be a powerful position to be in.

From an end user perspective what you really want is for all such rules to be stored as data, and for some mechanism to access the various repositories and formats in a seamless way, so that a complete perspective of enterprise data becomes possible. This desire may not necessarily be shared by all vendors, for whom control of business metadata is power. An opportunity for someone?

I’m not sure who had the idea of holding a data quality conference on Halloween, but it was either a luckyÂ coincidence or a truly inspired piece of scheduling.Â DAMA ran today in London, and continues tomorrow.Â This also fits with the seasonal festival, which originally was a Celtic festival over two days when the realÂ world and that of ghosts overlapped temporarily.Â Later the Christian church tried to claim it as their own by calling November 1st All HallowsÂ Day, with 31st October being All Hallows Eve, which in the fullness of time became Halloween.Â I will resist the temptation to point out the deterioration in data quality over time that this name change illustrates.Â The conference is held at the modern Victoria Park Plaza hotel, which is that rare thing in London: a venue that seems vaguely aware of technology.Â It is rumoured that there is even wireless access here, but perhaps that is just the ghosts whispering.

The usual otherworldly spiritsÂ were out on this day: the ghouls of the conference circuit were presenting (such as me), while scary topics like data architecture, metadata repositories and data models had their outings.Â TheÂ master data management monster seemed to be making a bid to take over the conference, with assorted data quality and other vendors who had never heard the phrase a year ago confidently asserting their MDM credentials.Â You’d have to be a zombie to fall for this, surely?Â At least one pitch I heard a truly contorted segue from classic data quality issues into MDM, with a hastily added slide basically saying “and all this name and address matching stuff is really what MDM is about really, anyway”.Â Come on guys, if you are going to try to pep up yourÂ data profiling tool with an MDM spin, at least try and do a little research.Â One vendor gave a convincing looking slide about a new real-time data quality tool which I know for a fact has no live customers, but then such horrors are nothing new in the software industry.

The conference itself was quite well attended, with about 170 proper customers, plus the usual hangers-on.Â Several of the speaker sessions over the conference do feature genuine experts in their field,Â so it seems the conference organisers have managed to minimiseÂ the witches brewÂ of barely disguised sales pitches by software sales VPs masquerading as independent “experts” that all too often pack conference agendas these days.Â

Just as it seems that the allure of ghosts is undiminished even in our modern age, so the problems aroundÂ the age-old issue of data quality seemÂ as spritely (sorry, I couldn’t resist that one)Â as ever.Â Â New technologies appear, but theÂ data qualityÂ in large corporations seems to beÂ largely impervious to technical solutions.Â It is a curious thing but given that data quality problems are very, very real, why can no one seem to make any real money in this market?Â Trillium is the market leader, and although it is no longer entirely clear what their revenues are, about USD 50M is what I see in my crystal ball. Other independent data quality vendors now swallowed by larger players had revenues in the sub USD 10M range when they were bought (Dataflux, Similarity Systems, Vality).Â First Logic was bigger at around USD 50M but the company went for a song (the USD 65M price tag givesÂ a revenue multiple no-oneÂ will be celebrating).Â Perhaps the newer generation of data quality vendors will have more luck.Â Certainly the business problem is as monstrous as ever.

Kalido has now announced revised positioning targeted at selling solutions to business problems (and will soon announce a new major product release). The key elements are as follows. The existing enterprise data warehouse and master data management product offerings remain, but have been packaged with some new elements into solutions which are effectively different pricing/functionality mechanisms on the same core code base.

The main positioning change is the introduction of pre-built business models on top of the core technology to provide â€œsolutionsâ€ in the areas of profitability management, specifically â€œcustomer profitabilityâ€ and â€œproduct profitabilityâ€. This move is, in many ways, long overdue, as Kalido was frequently deployed in such applications but previously made no attempt to provide a pre-configured data model. Given that Kalido is very strong at version management, it is about the one data warehouse technology that can plausibly offer this without falling into the â€œanalytic appâ€ trap whereby a pre-built data model, once tailored, quickly becomes out of synch with new releases (as Informatica can testify after their ignominious withdrawal from this market a few years ago). In Kalidoâ€™s case its version management allows for endless tinkering with the data model while still being able to recreate previous model versions.

Kalido also announced two new packaging offerings targeted at performance management/business intelligence, one for data mart consolidation and one for a repository for corporate performance management (the latter will be particularly aimed at Cognos customers, with whom Kalido recently announced a partnership). Interestingly, these two offerings are available on a subscription basis as an alternative to traditional licensing. This is a good idea, since the industry in general is moving towards such pricing models, as evidenced by salesforce.com in particular. In these days of carefully scrutinised procurement of large software purchases, having something the customers can try and out rent rather than buy should ease sales cycles.

The recent positioning change doesnâ€™t, however, ignore the IT audience â€“ with solution sets geared toward â€œEnterprise Data Managementâ€ and â€œMaster Data Management.â€ The enterprise data management category contains solutions that those familiar with Kalido will recognize as typical use cases â€“ departmental solutions, enterprise data warehouse and networked data warehouse. The key product advance here is in scalability. Kalido was always able to handle large volumes of transaction data (one single customer instance had over a billion transactions) but there was an Achilles heel if there was a single very large master data dimension of many million of records. In B2B situations this doesnâ€™t happen (how many products do you sell, or how many stores do you have â€“ tens or hundreds of thousands only) but in B2C situations e.g. retail banking and Telco, it could be a problem given that you could well have 50 million customers. Kalido was comfortable up to about 10 million master data items or so in a single dimension, but struggled much beyond that, leaving a federated (now â€œnetworkedâ€) approach as the only way forward. However in the new release some major re-engineering underneath the covers allows very large master data dimension in the 100 million range. This effectively removes the only real limitation on Kalido scalability; now you can just throw hardware at very large single instances, while Kalidoâ€™s unique ability to support a network of linked data warehouses continues to provide an effective way of deploying global data warehouses.

Technologically, Kalidoâ€™s master data management (MDM) product/solution is effectively unaffected by these announcements since it is a different code base, and a major release of this is due in January.

This new positioning targets Kalido more clearly as a business application, rather than a piece of infrastructure. This greater clarity is a result of its new CEO (Bill Hewitt), who has a strong marketing background, and should improve the market understanding of what Kalido is all about. Kalido always had differentiated technology and strong customer references (a 97% customer renewal rate testifies to that) but suffered from market positioning that switched too often and was fuzzy about the customer value proposition. This is an encouraging step in the right direction.

SOA (service oriented architecture) is nothing if not fashionable, but it seems to me that the reality could be a struggle.Â For a start, the very laudable idea, reuse of inter-operable application objects seamlessly, is not exactly new.Â Those of us who are a bit long in the tooth will recall the “applets” of the early 1990s, and then there was the whole CORBA thing.Â Everyone got excited about the idea of being able to mix and match application objects at will, but in practice everyone did the opposite and just bought monolithic suites from SAP, Oracle and other vendors who are mostly now owned by Oracle (JDE, Peoplesoft).Â What exactly is different this time around?Â

Surely in order to obtain true re-use you are going to need some standards (which have moved on a bit, but not enough), some way of mapping out business processes, and some way of dealing with the semantic integration of data (if two applications want to trade some data about “customer”, who controls what that particular term means?).Â In addition you need some solid infrastructureÂ to do some of the heavy lifting, and there are certainly some choices here.Â Â In this last area we have IBM Websphere, the immature Fusion from Oracle, Netweaver from SAP and Â independent alternatives like BEA.Â On the process mapping side there are newcomers like Cordys and a host of others.Â The trouble here is that it seems to me that the more complete the offering, the more credible it is yet the more difficult it will be to sell since enterprises already have a stack of established middleware that they do not want to swap out.Â If you already have an enterprise service but from (say) Cape Clear and EAI software from Tibco, plus some Netweaver as you have SAP applications, then how exactly do you combine these different technologies in a seamless way?Â The last thing you want to do is introduce something new unless you cannot avoid it.

To make matters worse, there has been little real progress onÂ data integration, particularly when it comes to semantic integration.Â The pure plays which have real technology have either been swallowed up (like Unicorn) or are relatively small vendors (e.g. Kalido).Â The giant vendors of middleware have little to offer here, and are intent on grabbing mindshare from each other rather than striving to make their technologies genuinely interoperable with those from other vendors.Â Master data management is as yet very immature as a mechanism for helping out here, and again the decent technologiesÂ live withÂ innovative but smaller vendors.Â Some partnerships between the giants and the best pure-plays will presumably be the way to go, but this has yet to really happen.

Finally, you have what may be the trickiest barrier of all, which is human.Â To get reuse then you need to be able to be aware of what is already out there (partly a technical issue) and also really want to take advantage of it (a more human problem).Â Programmers are by nature “not invented here” types, and this is one thing that made object orientation hard to scale in terms of reuse beyond programming teams.Â Whatever happened to those “exchanges” of objects?Â You can read about early SOA pilots, but I observe that these seem generally of limited scale and usually restricted to one set of middleware.Â This is a long way from the “plug and play” nirvana that is promised.

To be sure, SOA has a most appealing end-goal, and so will most likely run and run, yet to me the barriers in its path are considerable and some of the necessary components are yet to be fully matured.

Webmethods joined in the metadata/master data party through acquiring the assets of Cerebra, a company who brought “active metadata” technology to the market in 2001 but had struggled to make much market impact.Â Â As one of the pioneers of EAI technology, Webmethods makes a logical enough home for Cerebra, whose financial results are not known but whose website shows that it last managed a press release in March 2006.Â

Webmethods itself has managed to stabilise its position after some difficult years.Â At revenues of USD 201M it is a large company, but over the last five years it has averaged a new loss of over 12% of revenues.Â Even its last year, where it managed a small profit,Â represents a shrinking of revenue by nearly 4% over the prior year.Â The stockmarket has not been impressed, marking down the share price of Webmethods by 11% over the last 3 months.

Still, in principle Webmethods ought to be able to make good use of the Cerebra technology, since active discovery of corporate metadata is something that is quite relevant to EAI projects.Â Given Tibco’s entry into the area some time ago it perhaps only surprising how long it has taken them. Â Whether this will be enough to revive Webmethods’ fortunes remains to be seen.

If you go into Google youÂ can find most things remarkably quickly amongst the vastness of the internet, so why can’t you find yourÂ sales data inside a large company?Â This perfectly reasonable question has prompted some BI vendors to team up with Google in order to put a Google search front end onto enterprise data.Â Sound too good to be true?Â Sadly I fear that it is.Â The search capabilities of Google are superb at searching for keywords on websites, and enable you to quickly zero in on what you are looking for provided you can make your search keywords precise enough.Â Unfortunately the same technique does not translate well to the semantic nuances of enterprise data, where finding a database with “price” data in it unfortunately does not give you sufficient context (which for which product, under which commission scheme,Â on what date, within which sales area, etc?).Â Moreover a search engine does not yetÂ generate the SQL to get the wretched data out of the corporate databases where the answers lie.Â Hence to put a Google front end on to a BI tool you are probably going to have to run a bunch of reports, give them some tags and publish them as web pages – Google will certainly be able to deal with that, but then is this so much better than just picking the report you want out from a list anyway?

AndrewÂ Binstock writes a useful article about this in InfoworldÂ but perhaps glosses over the magnitude of the problem in terms of finding answers to data on an ad hoc basis.Â Indeed early implementations essentially throw the problem back to a BI tool, which generates results in a form that a search front end like Google can use.Â Usually this is not the biggest problem anyway, as it easy enough to put menus together with the top 20 or whatever regular reports for users to choose from.Â I can see a real use for this when the sheer number of canned reports gets out of hand though.Â If you have thousands of reports to trawl through then having a search front end could be genuinely useful.Â But the lack of semantic understanding needed of enterprise definitions will make it just as hard for a search tool to make any sense of a mass of numbers as a BI tool, which relies on either some from of front end semantic layer (as Business Objects uses) or assumes the existence of a data warehouse where the semantic complexity has been pre-resolved into a single consistent form.Â As the article correctly points out, the only way to fix this is through better metadata.Â Indeed, greatly improved master data definitions could find a further use as tags to help search engine front-ends make more sense of large numbers of pre-built corporate reports. Unfortunately the nirvana of ad hoc access to corporate data viaÂ an intuitive search front-end seems to me no closer than before.Â

What is certainly true is that the BI vendors can use these Google front-ends to make pretty demos to try and sell more software.Â However they do run a hidden danger in doing so.Â Given that at present the Bi vendors compete partly through the ease of use of their graphical interfaces, by handing over the user interface to Google they may be in danger of commoditising part of their competitive advantage.Â If you have a simple search front-end, who knows whether the report originally came from Business Objects, Cognos, or Information Builders?Â I wonder whether the BI vendors have really thought through the danger to their own businesses that this seemingly innocent search front end could become.Â By jumping on the Google bandwagon they could be unleashing something that removes their direct contact from the end user, a key element in differentiation.

In an article in DM Review Malcolm Chisholm discusses different types of metadata. He sets out a definition which distinguishes between metadata, master data and reference data (separate from â€œtransaction activityâ€ data). I believe that the argument is flawed in several important ways.

Firstly, I believe that the distinction between metadata, master data, enterprise structure data and reference data as made in the article is actually spurious. One point made about master data is the notion that â€œCustomer A is just Customer Aâ€ and here is not more to it than that. However, to the account manager looking after the customer there is a complex semantic which needs data to define it. Well, what if that customer is, say: â€œUnileverâ€. There is all kind of embedded meaning about the definition of Unilever that is not directly implied by the row itself, but is defined elsewhere e.g. is that the whole Unilever group of companies, or Unilever in the US, a Unilever factory or what? This type of definitional problem occurs to row level entries just as it does to the generic class of things called â€œcustomerâ€. Master data can have semantic meaning at the row level, just as can â€œreference dataâ€ as used in the article. This point is illustrated further if we use the articleâ€™s own example of this: the USA having multiple meanings. Both are valid perspectives for the USA but they are different things â€“ they are defined and differentiated by the states that make them up i.e. their composition. This is the semantic of the two objects.

The article seems to want to create ever more classification of data, including â€œenterprise structure dataâ€. It argues that â€œEnterprise structure data is often a problem because when it changes it becomes difficult to do historical reportingâ€. This is really just another type of master data. The problem of change can be dealt with by ensuring that all the data like this (and indeed all master data) has a â€œvalid fromâ€ and â€œvalid toâ€ date. Hence if an organisation splits into two, then we want to be able to view data as it was at a point in time: for example before and after the reorganisation. Time stamping the data in this way addresses this problem; having yet another type of master data classification does not help.

The distinction between â€œreference dataâ€ and â€œmaster dataâ€ made in the article seems to be both false and also misleading. Just because â€œvolumes of reference data are much lower than what is involved in master data and because reference data changes more slowlyâ€ in no way means that it needs be treated differently. In fact, it is a very difficult line to draw, since while typically master data may be more volatile, â€œreference dataâ€ also can change, with major effect, and so systems that store and classify it need to be able to expect and to deal with these changes.

In fact, one manâ€™s transaction is another manâ€™s reference data. A transaction like “payment” has Reference data like Payment Delivery, Customer, Product, Payment Type. A transaction
Delivery from the point of view of a driver might consist of Order, Product, Location, Mode of Delivery. Similarly an “order” could be viewed by a clerk as Contract, Product, Customer, Priority. Where is the line between Master and reference data to be drawn??

The article argues that identification is a major difference between master and reference data, that it is better to have meaningful rather than meaningless surrogate keys for things, which he acknowledges is contrary to perceived wisdom. In fact there are very good reasons to not embed the meaning of something in its coding structure. The article states that: â€œIn reality, they are causing more problems because reference data is even more widely shared than master data, and when surrogate keys pass across system boundaries, their values must be changed to whatever identification scheme is used in the receiving system.â€

But this is mistaken. Take the very real word example of article numbering. The Standard Industry codes (SIC) European Article Number (EAN) codes, which are attached to products like pharmaceuticals to enable pharmacists to uniquely identify a product. Here a high level part of the key is assigned e.g. to represent the European v. the US v. Australian e.g. GlaxoSmithKline in Europe, and then the rest of the key is defined as Glaxo wishes. If the article is referred to by another system e.g. a supplier of Glaxo, then it can be identified as one of Glaxoâ€™s products. This is an example of what is called a â€œglobal or universal unique identifierâ€ (GUID or UUID), and for which indeed there are emerging standards.

A complication is that when the packaging changes, even because of changed wording on the conditions of use, then a new EAN code has to be assigned. The codes themselves are structured, often considered bad practice in the IT world, but the idea is to ensure global uniqueness and not give meaning to the code. Before Glaxo Welcome and SmithKlienBeacham merged they each had separate identifiers and so the ownership of the codes changed when the merger took place.

Another point I disagree with in the article is â€œwe will be working with a much narrower scopeâ€ in the first paragraph. Surely we are trying to integrate information across the company to get a complete perspective. It is only small transactional applets which only need a worms eye view of what they are doing

The article says â€œReference data is any kind of data that is used solely to categorize other data in a database, or solely for relating data in a database to information beyond the boundaries of the enterpriseâ€. But someone in the organization does have to manage this data even if it comes from outside the company and that personâ€™s transaction may be the set up of this data and making it available to others.

For example, consider the setting of a customerâ€™s credit rating. Someone in Finance has to review a new customerâ€™s credit rating against a list of externally defined credit ratings say from D&B. Someone in the company spends time lobbying D&B (or parliament/congress) to have additional credit classifications. (the article defines them as Gold, Silver, Bronze etc. But D&B call them AAA, AA etc.). Data is always created through someone carrying out some business function (or transaction) even standards have to be managed somewhere.

A good example of this type of external data where a computer system is used to support the process is the Engineering parts library. It uses the ISO 15926 standard. It is a collaborative process between specialists from multiple engineering companies. It is a high level classification scheme which is used to create a library of spare parts for cars, aircraft, electronics etc. This is a changing world and there are always new and changing classifications. Groups of engineers who are skilled in some engineering domain define the types and groups of parts. One group defines pumps, another piping. Someone proposes a change and others review it to see if it will impact their business, it goes through a review process and ultimately gets authorized as part of the standard.

This example is about reference data, in the terms of the article, but it clearly has the problem the article attributes to master data. There are multiple versions and name changes and a full history of change has to be maintained if you wish to relate things from last year with things for this year.

The artiicle has an example concerning the marketing departmentâ€™s view of customer v. accounts view of customer. It says this is a master data management issue and is semantic but this doesnâ€™t apply to reference data. It clearly does relate to reference data. (see definition of USA above) and the ISO example above. But what is more important is that the issue can be resolved for both master and reference data by adopting the standards for integration defined in ISO 15926. Instead of trying to define customer in a way that satisfies everyone it is best to find what is common and what is different. Customers in both definitions are Companies â€“ it is just that some of then have done business with us and others have not (yet). Signed up customers are a subset of all potential customers.

At the end of the section on The Problem of Meaning the article says â€œThese diverse challenges require very different solutionsâ€ then in the section on Links between Master and Reference data it says â€œIf there is a complete separation of master and reference data management, this can be a nightmareâ€ and then says â€œwe must think carefully about enterprise information as a wholeâ€. I agree with this final statement but it is critical that we do not put up artificial boundaries and try to solve specific problems with some generic rules which differentiate according to some rather arbitrary definition such as Master and Reference data.

The line between master and reference data is really fuzzy in the definition used. Clearly â€œProductâ€ is master data but I if have a retail gasoline customer which has only three products (Unleaded, Super and Diesel) I guess that means this is reference data. The engineering parts library classification scheme is a complex structure with high volumes (1000â€™s) of classes so that makes it master data but it is outside the company so does that makes it reference data?

In summary, the article takes a very IT-centric transactional view of the world. By trying to create separate classifications where in fact none exist, the approach suggested, far from simplifying things, will in fact cause serious problems if implemented, as when these artificial dividing lines blur (which they will) then the systems relying on them will break. Instead what is needed is not separation, but unity. Master data is master data is master data, whether it refers to the structure of an enterprise, a class of thing or an instance of a thing. It needs to be time-stamped and treated in a consistent way with other types of master data, not treated arbitrarily differently. Consistency works best here.

I am indebted to Bruce Ottmann, one of the world’s leading data modelers, for some of the examples used in this blog.

David Stodder makes a good point in an article in intelligent Enterprise. Business rules take many forms in a large corporation but today they are quite opaque. Rules that define even basic terms like “gross margin” may not only be buried away in complex spreadsheet models or ERP systems, but are in practice usually held in many different places, with no guarantee of consistency. I know of one company where an internal audit revealed twenty different definitions of “gross margin”, and that was within just one subsidiary of the company! In these days of stricter compliance such things are no longer merely annoying.

My observation is that business customers need to take ownership of, and be heavily engaged with, any process to try and improve this situation. It cannot be an IT-driven project. It is not critical whether the ultimate repository of this is a data warehouse, a master data repository or some different business rules repository entirely, but it is key that the exercise actually happens. At present the opaquenes and lack of consistency of business rules is not something that most companies care to own up to, yet it is a major controls issue as well as a source of a great deal of rework and difficulty in presenting accurate data in many contexts.

I was amused by the readership poll quoted that said that 61% of respondents say that they have “no standard process or practice” for business rules management. This might imply that 39% actually did, a number I would treat with considerable caution. Personally I have yet to encounter any that does so on a global basis.

Andy Hayler

Andy Hayler is a passionate and outspoken commentator on the enterprise software market. A 20-year veteran of data modelling, warehousing and integration projects, he was named a Red Herring Top 10 Innovator in 2002 for founding Kalido – an innovative information management company that provides customers with the ability to dynamically view the impact of business changes. The views expressed on this blog are Andy’s own, and do not necessarily reflect the views of The Information Difference.