Archive for the ‘Business Intelligence’ Category

Data profiling is an excellent diagnostic method for gaining additional understanding of the data. Profiling the source data helps inform both business requirements definition and detailed solution designs for data-related project, as well as enabling data issues to be managed ahead of project implementation.

Profiling of a data set will be measured with reference to and agreed Data Quality Dimensions (e.g. per those proposed in the recent DAMA white paper).

Profiling may be required at several levels:

• Simple profiling with a single table (e.g. Primary Key constraint violations)
• Medium complexity profiling across two or more interdependent tables (e.g. Foreign Key violations)
• Complex profiling across two or more data sets, with applied business logic (e.g. reconciliation checks)

Note that field-by-field analysis is required to truly understand the data gaps.

Any data profiling analysis must not only identify the issues and underlying root causes, but must also identify the business impact of the data quality problem (measured by effectiveness, efficiency, risk inhibitors). This will help identify any value in remediating the data – great for your data quality Business Case. Root cause analysis also helps identify any process outliers and and drives out requirements for remedial action on managing any identified exceptions.

Be sure to profile your data and take baseline measures before applying any remedial actions – this will enable you to measure the impact of any changes.

I strongly recommend Data Quality Profiling and root-cause analysis to be undertaken as an initiation activity as part of all data warehouse, master data and application migration project phases.

Over the years, I’ve tended to find that asking any individual or group the question “What data/information do you want?” gets one of two responses:

“I don’t know.” Or;

“I don’t know what you mean by that.”

End of discussion, meeting over, pack up go home, nobody is any the wiser. Result? IT makes up the requirements based on what they think the business should want, the business gets all huffy because IT doesn’t understand what they need, and general disappointment and resentment ensues.

Clearly for Information Management & Business Intelligence solutions, this is not a good thing.

So I’ve stopped asking the question. Instead, when doing requirements gathering for an information project, I go through a workshop process that follows the following outline agenda:

Context setting: Why information management / Business Intelligence / Analytics / Data Governance* is generally perceived to be a “good thing”. This is essentially a very quick précis of the BI project mandate, and should aim at putting people at ease by answering the question “What exactly are we all doing here?”

(*Delete as appropriate).

Business Function & Process discovery: What do people do in their jobs – functions & tasks? If you can get them to explain why they do those things – i.e. to what end purpose or outcome – so much the better (though this can be a stretch for many.)

Challenges: what problems or issues do they currently face in their endeavours? What prevents them from succeeding in their jobs? What would they do differently if they had the opportunity to do so?

Opportunities: What is currently good? Existing capabilities (systems, processes, resources) are in place that could be developed further or re-used/re-purposed to help achieve the desired outcomes?

Desired Actions: What should happen next?

As a consultant, I see it as part of my role to inject ideas into the workshop dialogue too, using a couple of question forms specifically designed to provoke a response:

“What would happen if…X”

“Have you thought about…Y”

“Why do you do/want…Z”.

Notice that as the workshop discussion proceeds, the participants will naturally start to explore aspects that relate to later parts of the agenda – this is entirely ok. The agenda is there to provide a framework for the discussion, not a constraint. We want people to open up and spill their guts, not clam up. (Although beware of the “rambler” who just won’t shut up but never gets to the point…)

Notice also that not once have we actively explored the “D” or “I” words. That’s because as you explore the agenda, any information requirements will either naturally fall out of the discussion as it proceed, or else you can infer the information requirements arising based on the other aspects of the discussion.

As the workshop attendees explore the different aspects of the session, you will find that the discussion will touch upon a number of different themes, which you can categorise and capture on-the-fly (I tend to do this on sheets of butchers paper tacked to the walls, so that the findings are shared and visible to all participants.). Comments will typically fall into the following broad categories:

* Functions: Things that people do as part of doing business.* Stakeholders: people who are involved (including helpful people elsewhere in the organisation – follow up with them!)* Inhibitors: Things that currently prevent progress (these either become immediate scope-change items if they are show-stoppers for the current initiative, or else they form additional future project opportunities to raise with management)* Enablers: Resources to make use of (e.g. data sets that another team hold, which aren’t currently shared)* Constraints: “non-negotiable” aspects that must be taken into account. (Note: I tend to find that all constraints are actually negotiable and can be overcome if there is enough desire, money and political will.)* Considerations: Things to be aware of that may have an influence somewhere along the line.* Source systems: places where data comes from* Information requirements: Outputs that people want

Workshop Victim Participant #2: “We think there’s a discrepancy in the warehouse stock balances, compared with what’s been shipped to customers. The sales guys keep their own database of customer contracts and orders and Jim’s already given us dump of the data, while finance run the accounts receivables process. But Sally the Accounts Clerk doesn’t let the numbers out under any circumstances, so basically we’re screwed.”

You will also probably end up with the attendees identifying a number of immediate self-assigned actions arising from the discussion – good ideas that either haven’t occurred to them before or have sat on the “To-Do” list. That’s your workshop “value add” right there….

e.g.
Workshop Victim Participant #1: “I could go and speak to the Financial Controller about getting access to the finance data. He’s more amenable to working together than Sally, who just does what she’s told.”

Businesses tapping into the potential of cloud computing now make up the vast majority of enterprises out there. If anything, it’s those companies disregarding the cloud that have fallen way behind the rest of the pack. According to the most recent State of the Cloud Survey from RightScale, 87% of organizations are using the public cloud. Needless to say, businesses have figured out just how advantageous it is to make use of cloud computing, but now that it’s mainstream, experts are trying to predict what’s next for the incredibly useful technology. Here’s a look at some of the latest trends for cloud computing for 2014 and what’s to come in the near future.

1. IT Costs Reduced

There are a multitude of reasons companies have gotten involved with cloud computing. One of the main reasons is to reduce operational costs. This has proven true so far, but as more companies move to the cloud, those savings will only increase. In particular, organizations can expect to see a major reduction in IT costs. Adrian McDonald, president of EMEA at EMC, says the unit cost of IT could decrease by more than 38%. This development could allow for newer, more creative services to come from the IT department.

2. More Innovations

Speaking of innovations, the proliferation of cloud computing is helping business leaders use it for more creative solutions. At first, many felt cloud computing would allow companies to run their business in mostly the same way only with a different delivery model. But with cloud computing becoming more common, companies are finding ways to obtain new insights into new processes, effectively changing the way they were doing business before.

3. Engaging With Customers

To grow a business, one must attract new customers and hold onto those that are already loyal. Customer engagement is extremely important, and cloud computing is helping companies find new ways to do just that. By powering systems of engagement, cloud computing can optimize how businesses interact with customers. This is done with database technologies along with collecting and analyzing big data, which is used to create new methods of reaching out to customers. With cloud computing’s easy scalability, this level of engagement is within the grasp of every enterprise no matter the size.

4. More Media

Another trend to watch out for is the increased use of media among businesses, even if they aren’t media companies. Werner Vogels, the vice president and CTO of Amazon, says that cloud computing is giving businesses media capabilities that they simply didn’t have before. Companies can now offer daily, fresh media content to customers, which can serve as another avenue for revenue and retention.

5. Expansion of BYOD

Bring Your Own Device (BYOD) policies are already pretty popular with companies around the world. With cloud computing reaching a new high point, expect BYOD to expand even faster. With so many wireless devices being used by people, that necessitates use of the cloud in order to store and access valuable company data. IT personnel are also finding ways to use cloud services through mobile device management, mainly to organize and keep track of each worker’s activities.

6. More Hybrid Cloud

Whereas before there was a lot of debate over whether public or private cloud should be used by a company, it has now become clear that businesses are choosing to use hybrid clouds. The same RightScale cloud survey mentioned before shows that 74% of organizations have already developed a hybrid cloud strategy, with more than half of them already using it. Hybrid clouds combine private cloud security with the power and scalability of public clouds, basically giving companies the advantages of both. It also allows IT to come up with customized solutions while maintaining a secure infrastructure.
These are just a few of the trends that are happening as cloud computing expands. Its growth has been staggering, fueling greater innovation in companies as they look to save on operational costs. As more and more businesses get used to what the cloud has to offer and how to take full advantage of its benefits, we can expect even greater developments in the near future. For now, the technology will continue to be a valuable asset to every organization that makes the most of it.

Data Management lore would have us believe that estimating the amount of work involved in Data Quality analysis is a bit of a “Dark Art,” and to get a close enough approximation for quoting purposes requires much scrying, haruspicy and wet-finger-waving, as well as plenty of general wailing and gnashing of teeth. (Those of you with a background in Project Management could probably argue that any type of work estimation is just as problematic, and that in any event work will expand to more than fill the time available…).

At first glance, the overall methodology that David proposes is reasonable in terms of estimating effort for a pure profiling exercise – at least in principle. (It’s analogous to similar “bottom/up” calculations that I’ve used in the past to estimate ETL development on a job-by-job basis, or creation of standards Business Intelligence reports on a report-by-report basis).

I would observe that David’s approach is predicated on the (big and probably optimistic) assumption that we’re only doing the profiling step. The follow-on stages of analysis, remediation and prevention are excluded – and in my experience, that’s where the real work most often lies! There is also the assumption that a pre-existing checklist of assessment criteria exists – and developing the library of quality check criteria can be a significant exercise in its own right.

However, even accepting the “profiling only” principle, I’d also offer a couple of additional enhancements to the overall approach.

Firstly, even with profiling tools, the inspection and analysis process for any “wrong” elements can go a lot further than just a 10-minute-per-item-compare-with-the-checklist, particularly in data sets with a large number of records. Also, there’s the question of root-cause diagnosis (And good DQ methods WILL go into inspecting the actual member records themselves). So for contra-indicated attributes, I’d suggest a slightly extended estimation model:

* 10mins: for each “Simple” item (standard format, no applied business rules, fewer that 100 member records)
* 30 mins: for each “Medium” complexity item (unusual formats, some embedded business logic, data sets up to 1000 member records)
* 60 mins: for any “Hard” high-complexity items (significant, complex business logic, data sets over 1000 member records)

Secondly, and more importantly – David doesn’t really allow for the human factor. It’s always people that are bloody hard work! While it’s all very well to do a profiling exercise in-and-of-itself, the result need to be shared with human beings – presented, scrutinised, questioned, validated, evaluated, verified, justified. (Then acted upon, hopefully!) And even allowing for the set-aside of the “Analysis” stages onwards, then there will need to be some form of socialisation within the “Profiling” phase.

That’s not a technical exercise – it’s about communication, collaboration and co-operation. Which means it may take an awful lot longer than just doing the tool-based profiling process!

How much socialisation? That depends on the number of stakeholders, and their nature. As a rule-of-thumb, I’d suggest the following:

* Two hours of preparation per workshop ((If the stakeholder group is “tame”. Double it if there are participants who are negatively inclined).
* One hour face-time per workshop (Double it for “negatives”)
* One hour post-workshop write-up time per workshop
* One workshop per 10 stakeholders.
* Two days to prepare any final papers and recommendations, and present to the Steering Group/Project Board.

That’s in addition to David’s formula for estimating the pure data profiling tasks.

A “foreign” colleague of mine once told me a trick his English language teacher taught him to help him remember the “questioning words” in English. (To the British, anyone who is a non-native speaker of English is “foreign.” I should also add that as a Scotsman, English is effectively my second language…).

“Five Whiskies in a Hotel” is the clue – i.e. five questioning words begin with “W” (Who, What, When, Why, Where), with one beginning with “H” (How).

These simple question words give us a great entry point when we are trying to capture the initial set of issues and concerns around data governance – what questions are important/need to be asked.

* What data/information do you want? (What inputs? What outputs? What tests/measures/criteria will be applied to confirm whether the data is fit for purpose or not?)* Why do you want it? (What outcomes do you hope to achieve? Does the data being requested actually support those questions & outcome? Consider Efficiency/Effectiveness/Risk Mitigation drivers for benefit.)* When is the information required? (When is it first required? How frequently? Particular events?)* Who is involved? (Who is the information for? Who has rights to see the data? Who is it being provided by? Who is ultimately accountable for the data – both contents and definitions? Consider multiple stakeholder groups in both recipients and providers)* Where is the data to reside? (Where is it originating form? Where is it going to?)* How will it be shared? (How will the mechanisms/methods work to collect/collate/integrate/store/disseminate/access/archive the data? How should it be structured & formatted? Consider Systems, Processes and Human methods.)

Is this the kind of response you get when you mention to people that you work in Data Quality?!

Let’s be honest here. Data Quality is good and worthy, but it can be a pretty dull affair at times. Information Management is something that “just happens”, and folks would rather not know the ins-and-outs of how the monthly Management Pack gets created.

Yet I’ll bet that they’ll be right on your case when the numbers are “wrong”.

Right?

So here’s an idea. The next time you want to engage someone in a discussion about data quality, don’t start by discussing data quality. Don’t mention the processes of profiling, validating or cleansing data. Don’t talk about integration, storage or reporting. And don’t even think about metadata, lineage or auditability. Yaaaaaaaaawn!!!!

Instead of concentrating on telling people about the practitioner processes (which of course are vital, and fascinating no doubt if you happen to be a practitioner), think about engaging in a manner that is relevant to the business community, using language and examples that are business-oriented. Make it fun!

Once you’ve got the discussion flowing in terms of the impacts, challenges and inhibitors that get in the way of successful business operations, then you can start to drill into the underlying data issues and their root causes. More often than not, a data quality issue is symptomatic of a business process failure rather than being an end in itself. By fixing the process problem, the business user gains a benefit, and the data in enhanced as a by-product. Everyone wins (and you didn’t even have to mention the dreaded DQ phrase!)

Data Quality is a human thing – that’s why its hard. As practitioners, we need to be communicators. Lead the thinking, identify the impact and deliver the value.

Up until now, I’ve always struggled to think of a way to represent all of the different aspects of Information Management/Data Governance; the environment is multi-faceted, with the interconnections between the component capabilities being complex and not hierarchical. I’ve sometimes alluded to there being a network of relationship between elements, but this has been a fairly abstract concept that I’ve never been able to adequately illustrate.

When I was a kid growing up in the UK, Paul Daniels was THE television magician. With a combination of slick high drama illusions, close-up trickery and cheeky end-of-the-pier humour, (plus a touch of glamour courtesy of The Lovely Debbie McGee TM), Paul had millions of viewers captivated on a weekly basis and his cheeky catch-phrases are still recognised to this day.

Of course. part of the fascination of watching a magician perform is to wonder how the trick works. “How the bloody hell did he do that?” my dad would splutter as Paul Daniels performed yet another goofy gag or hair-raising stunt (no mean fear, when you’re as bald as a coot…) But most people don’t REALLY want to know the inner secrets, and ever fewer of us are inspired to spray a riffle-shuffled a pack of cards all over granny’s lunch, stick a coin up their nose or grab the family goldfish from its bowl and hide it in the folds of our nether-garments. (Um, yeah. Let’s not go there…)

Penn and Teller are great of course, because they expose the basic techniques of really old, hackneyed tricks and force more innovation within the magician community. They’re at their most engaging when they actually do something that you don’t get to see the workings of. Illusion maintained, audience entertained.

As data practitioners, I think we can learn a few of these tricks. I often see us getting too hot-and-bothered about differentiating data, master data, reference data, metadata, classification scheme, taxonomy, dimensional vs relational vs data vault modelling etc. These concepts are certainly relevant to our practitioner world, but I don’t necessarily believe they need to be exposed at the business-user level.

For example, I often hear business users talking about “creating the metadata” for an event or transaction, when they’re talking about compiling the picklist of valid descriptive values and mapping these to the contextualising descriptive information for that event (which by my reckoning, really means compiling the reference data!). But I’ve found that business people really aren’t all that bothered about the underlying structure or rigour of the modelling process.

Grover is a semantic annotation markup syntax based on the grammar of the English language. Grover is related to the Object Management Group’s Semantics of Business Vocabulary and Rules (SBVR), explained later. Grover syntax assigns roles to common parts of speech in the English language so that simple and structured English phrases are used to name and relate information on the semantic web. By having as clear a syntax as possible, the semantic web is more valuable and useful.

An important open-source tool for semantic databases is SemanticMediaWiki that permits everyone to create a personal “wikipedia” in which private topics are maintained for personal use. The Grover syntax is based on this semantic tool and the friendly wiki environment it delivers, though the approach below might also be amenable to other toolsets and environments.

Basic Approach. Within a Grover wiki, syntax roles are established for classes of English parts of speech.

Subject:noun(s) -- verb:article/verb:preposition -- Object:noun(s)

refines the standard Semantic Web pattern:

SubjectURL -- PredicateURL -- ObjectURLwhile in a SemanticMediaWiki environment, with its relative URLs, this is the pattern:

In a Grover wiki, topic types are nouns, more precisely nounal expressions, are concepts. Every concept is defined by a specific semantic database query, these queries being the foundation of a controlled enterprise vocabulary. In Grover every pagename is the name of a topic and every pagename includes a topic-type prefix. Example: Person:Barack Obama and Title:USA President of the United States of America, two topics related together through one or more predicate relations, for instance “has:this”. Wikis are organized into ‘namespaces’ — its pages’ names are each prefixed with a namespace-name, which function equally as topic-type names. Additionally, an ‘interwiki prefix’ can indicate the URL of the wiki where a page is located — in a manner compatible with the Turtle RDF language.

Nouns (nounal expressions) are the names of topic-types and or of topics; in ontology-speak, nouns are class resources or nouns are individual resources but rarely are nouns defined as property resources (and thereby used as a ‘predicate’ in the standard Semantic Web pattern, mentioned above). This noun requirement is a systemic departure from today’s free-for-all that allows nouns to be part of the name of predicates, leading to the construction of problematic ontologies from the perspective of common users.verbsIn a Grover wiki, “property names” are an additional ontology component forming the bedrock of a controlled semantic vocabulary. Being pages in the “Property” namespace means these are prefixed with the namespace name, “Property”. However the XML namespace is directly implied, for instance has:this implies a “has” XML Namespace. The full pagename of this property is “Property:has:this. The tenses of a verb — infinitive, past, present and future — are each an XML namespace, meaning there are separate have, has, had and will-have XML Namespaces. The modalities of a verb are also separate XML Namespace, may and must. Lastly the negation form for verbs (involving not) are additional XML Namespaces.

The “verb” XML Namespace name is only one part of a property name. The other part of a property name is either a preposition or it is a grammatical author. Together, these comprise an enterprise’s controlled semantic vocabulary.

prepositions

As in English grammar, prepositions are used to relate an indirect object or object of a preposition, to a subject in a sentence. Example: “John is at the Safeway” uses a property named “is:at” to yield the triple Person:John -- is:at -- Store:Safeway. There are approximately about one hundred english prepositions possible for any particular verbal XML Namespace. Examples: had:from, has:until and is:in.

articles

As in English grammar, articles such as “a” and “the” are used to relate direct objects or predicate nominatives to a subject in a sentence. As for prepositions above, articles are associated with a verb XML Namespace. Example: has:a:, has:this, has:these, had:somehas:some and will-have:some.

adjectivesIn a Grover wiki, definitions in the “category” namespace include adjectives, such as “Public” and “Secure”. These categories are also found in a controlled modifier vocabulary. The category namespace also includes definitions for past participles, such as “Secured” and “Privatized”. Every adjective and past participle is a category in which any topic can be placed. A third subclass of modifiers include ‘adverbs’, categories in which predicate instances are placed.

That’s about all that’s needed to understand Grover, the Business Syntax for Semantic English! Let’s use the Grover syntax to implement a snippet from the Object Management Group’s Semantics of Business Vocabulary and Rules (SBVR) which has statements such as this for “Adopted definition”:

adopted definition

Definition: definition that a speech community adopts from an external source by providing a reference to the definition.

Necessities: (1) The concept ‘adopted definition’ is included in Definition Origin. (2) Each adopted definition must be for a concept in the body of shared meanings of the semantic community of the speech community.

Now we can use Grover’s syntax to ‘adopt’ the OMG’s definition for “Adopted definition”.

This simplified but structured English permits the widest possible segment of the populace to participate in constructing and perfecting an enterprise knowledge base built upon the Resource Description Framework.

More complex information can be specified on wikipages using standard wiki templates. For instance to show multiple references on the “Term:Adopted definition” page, the “has:this” wiki template can be used:

One important feature of the Grover approach is its modification of our general understanding about how ontologies are built. Today, ontologies specify classes, properties and individuals; a data model emerges from listings of range/domain axioms associated with a propery’s definition. Instead under Grover, an ontology’s data models are explicitly stated with deontic verbs that pair subjects with objects; this is an intuitively stronger and more governable approach for such a critical enterprise resource as the ontology.

You’re expected to work your data quality magic, solve other people’s data problems, and help people get better business outcomes. It’s a valuable, worthy and satisfying profession. But people can be infuriating and frustrating, especially when the business user isn’t taking responsibility for the quality of their own data.

It’s a bit like being a Medical Doctor in general practice.

The patent presents with some early indicative symptoms. The MD then performs a full diagnosis and recommends a course of treatment. It’s then up to the patient whether or not they take their MD’s advice…

AlanDDuncan: “Doctor, Doctor. I get very short of breath when I go upstairs.”
MD: Yes, well. Your Body Mass Index is over 30, you’ve got consistently high blood pressure, your heatbeat is arrhythmic, and cholesterol levels are off the scale.”
ADD: “So what does that mean, doctor?”
MD: “It means you’re fat, you drink like a fish, you smoke like a chimney, your diet consists of fried food and cakes and you don’t do any exercise.”
ADD: “I’m Scottish.”
MD: “You need to change your lifestyle completely, or you’re going to die.”
ADD: “Oh. So, can you give me some pills?….”

If you’re going to get healthy with your data, you’ll going to have to put the pies down, step away from the Martinis and get off the couch folks.