urn:lsid:ibm.com:blogs:entries-17b40b29-ada7-4f6c-82c9-8fdc43b1cf50Journeys in the Information LandscapeDiscussion of common themes and challenges in discovering, understanding/assessing, integrating, utilizing, managing, and governing information.030112015-01-07T17:11:57-05:00IBM Connections - Blogsurn:lsid:ibm.com:blogs:entry-11f263c3-5f4a-49f4-bf4b-6b660776839eSelf-Reference for your Information Governance Policiessmithha110000PAKNactivefalsesmithha110000PAKNactivefalseComment EntriesLikestrue2015-01-07T17:11:57-05:002015-01-07T17:11:57-05:00<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
Last year concluded with the release of my developerWorks article and content for IBM&#39;s Information Governance Catalog representing a good portion of my journey in the information landscape last year.&nbsp; (See:&nbsp; <a href="http://www.ibm.com/developerworks/data/library/techarticle/dm-1412infosphere-governance/index.html?ca=drs-">Establish an information governance policy framework in InfoSphere Information Governance Catalog</a>)&nbsp; And while satisfying to see that delivered, my blogging was sadly short-changed.&nbsp; So it seems fitting to kick off the New Year with something of a relaunch to the blog, and to do so by expanding on a number of aspects of the information governance content that didn&#39;t fit the scope of the article.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
One aspect that I find intriguing with information governance is that it has a self-reflective and self-referential character.&nbsp; Consider these questions:&nbsp; Have you defined the scope of your information governance effort?&nbsp; What policies do you have to support it?&nbsp; What requirements are needed for it?&nbsp; How is your information governance program monitored?&nbsp; How effective is it?&nbsp; Who is managing the effort?</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
These are much the same questions as you&#39;d ask about specific subjects within information governance whether data privacy, customer data quality, or specific government regulations.&nbsp; Which suggests that you can follow the same process of mapping out policies and requirements for your information governance program as you do for mapping out policies and requirements in any of these specific domains.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
<span style="font-size:16px;"><span style="font-weight: bold;">Establishing Policies for an Information Governance Program</span></span></p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
Consider the set of Information Governance capabilities focused on Compliance.&nbsp; These reflect a broad goal of communicating policy, detecting exceptions and remediating issues, enforcing the policies, and auditing the information governance processes. For each of these capabilities, there are certain requirements that you want to establish in your organization.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
<a href="https://www.ibm.com/developerworks/community/blogs/haraldsmith/resource/BLOGS_UPLOADED_IMAGES/InfoGovCompliancecapabilities.gif" style="width: 100%; display: inline-block;" target="_blank"><img alt="image" src="https://www.ibm.com/developerworks/community/blogs/haraldsmith/resource/BLOGS_UPLOADED_IMAGES/InfoGovCompliancecapabilities.gif" style=" display:block; margin: 0 auto;text-align: center;" /></a></p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
Let&#39;s consider the first:&nbsp; <span style="font-weight:bold;">Policy Definition</span>. Broadly, this is part of what you might categorize as Policy Administration.&nbsp; To be effective, you want to describe your policy <span style="font-style:italic;text-decoration:underline;">for</span> Policy Definition.&nbsp; You could describe the intent as something like this: <span style="font-style:italic;">&quot;All information governance policies are defined so that they are clearly communicated to everyone in the organization.&quot;</span></p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
Likely you have more detailed requirements.&nbsp; For your policy for Policy Definition these could be:</p>
<ul dir="ltr">
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-style:italic;"><span style="font-size:11.0pt;"><span style="font-family:calibri;">Information Governance Policies must be easy to find</span></span></span></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-style:italic;"><span style="font-size:11.0pt;"><span style="font-family:calibri;">Information Governance Policies must have a name and definition</span></span></span></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-style:italic;"><span style="font-size:11.0pt;"><span style="font-family:calibri;">Information Governance Policies must have an issuing organization (internal or external)</span></span></span></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-style:italic;"><span style="font-size:11.0pt;"><span style="font-family:calibri;">Information Governance Policies should detail underlying requirements</span></span></span></li>
</ul>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
<span style="font-size:16px;"><span style="font-weight: bold;">Adding into the Information Governance Catalog</span></span></p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
With the above information, you can add this into the Information Governance Catalog.&nbsp; <a href="https://www.ibm.com/developerworks/community/blogs/haraldsmith/resource/BLOGS_UPLOADED_IMAGES/IGCPolicyHierarchy.png" target="_blank"><img alt="image" src="https://www.ibm.com/developerworks/community/blogs/haraldsmith/resource/BLOGS_UPLOADED_IMAGES/IGCPolicyHierarchy.png" style=" display:block; margin: 1em 0pt 0pt 1em; float: right;" /></a>Using the content provided through the article noted above, you could add such a policy within the Information Governance Approaches policy hierarchy.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
Such a decision is arbitrary and really depends on how you structure and view your organization.&nbsp; You could equally describe this as a Corporate Requirement under your Information Governance Obligations.&nbsp; Or perhaps you consider compliance to information governance to be one of your Information Governance Principles.&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
None of these are wrong.&nbsp; But I find it useful, and I think less confusing to many of those using the policies, to put these policies about information governance policies in their own hierarchy -- here under the policy category Information Governance Approaches.&nbsp; (Note that naming is also arbitrary.&nbsp; Feel free to take this content and modify to reflect your own choice of words.)</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
To add an underlying policy on Policy Definition, create a new policy and provide the common information you want to include:&nbsp; name, short description, long description if relevant, the parent policy so the policy is in the right hierarchy, a steward, and any custom attributes that are useful (I&#39;ve included an Issuing Organization and a Link to more Policy information).</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
<a href="https://www.ibm.com/developerworks/community/blogs/haraldsmith/resource/BLOGS_UPLOADED_IMAGES/IGCPolicyDefinition.png" style="width: 100%; display: inline-block;" target="_blank"><img alt="image" src="https://www.ibm.com/developerworks/community/blogs/haraldsmith/resource/BLOGS_UPLOADED_IMAGES/IGCPolicyDefinition.png" style=" width:100%; display:block; margin: 1em 0pt 0pt 0pt; float: left;" /></a></p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
You can take similar steps to add in associated governance rules, which I&nbsp; think of as the primary requirements of the policy since they allow you to elaborate on the contents and to connect the policy to relevant Glossary Terms or to metadata assets that either implement the governance rule or are governed by the rule.&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
Depending on the custom attributes available, you might add in additional information such as scope (does this apply to all policies or only certain categories of policies), more specific implementation requirements, exception handling, etc.&nbsp; After adding the four requirements listed above, your policy for Policy Definition might look like the following.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
<a href="https://www.ibm.com/developerworks/community/blogs/haraldsmith/resource/BLOGS_UPLOADED_IMAGES/IGCPolicyWithRules.png" style="width: 100%; display: inline-block;" target="_blank"><img alt="image" src="https://www.ibm.com/developerworks/community/blogs/haraldsmith/resource/BLOGS_UPLOADED_IMAGES/IGCPolicyWithRules.png" style=" width:100%; display:block; margin: 1em 0pt 0pt 0pt; float: left;" /></a></p>
<p dir="ltr" style="margin:0in;font-family:Calibri;font-size:11.0pt">
&nbsp;</p>
<p dir="ltr" style="margin:0in;font-family:Calibri;font-size:11.0pt">
<span style="font-size:16px;"><span style="font-weight: bold;">Benefits of defining and building policies for the Information Governance Program</span></span></p>
<p dir="ltr" style="margin:0in;font-family:Calibri;font-size:11.0pt">
&nbsp;</p>
<p dir="ltr" style="margin:0in;font-family:Calibri;font-size:11.0pt">
There are three useful benefits with this approach:</p>
<p dir="ltr" style="margin:0in;font-family:Calibri;font-size:11.0pt">
1) The self-referential aspect here means that you are <span style="font-weight:bold;
font-style:italic">defining the approaches others will use in defining and establishing their governance policies and rules</span> in the Information Governance Catalog.</p>
<p dir="ltr" style="margin:0in;font-family:Calibri;font-size:11.0pt">
2) You are <span style="font-weight:bold;font-style:italic">establishing the focal areas of your information governance program</span>.</p>
<p dir="ltr" style="margin:0in;font-family:Calibri;font-size:11.0pt">
3) You are <span style="font-weight:bold;font-style:italic">providing yourself with an opportunity to define, build, and learn how you can use and structure the Information Governance Catalog</span> to fit your information governance program needs.</p>
<p dir="ltr" style="margin:0in;font-family:Calibri;font-size:11.0pt">
&nbsp;</p>
<p dir="ltr" style="margin:0in;font-family:Calibri;font-size:11.0pt">
Some thoughts and questions for you:</p>
<ul dir="ltr" style="margin-left:.375in;direction:ltr;unicode-bidi:embed;
margin-top:0in;margin-bottom:0in" type="disc">
<li style="margin-top:0;margin-bottom:0;vertical-align:middle">
<p>
<span style="font-family:Calibri;font-size:11.0pt">Do you use your own glossary to describe the terms, policies, and rules of your information governance program?</span></p>
<ul style="margin-left:.375in;direction:ltr;unicode-bidi:embed;
margin-top:0in;margin-bottom:0in" type="circle">
<li style="margin-top:0;margin-bottom:0;vertical-align:middle">
<p>
<span style="font-family:Calibri;font-size:11.0pt">If so, what have you found that works well?</span></p>
</li>
</ul>
<ul style="margin-left:.375in;direction:ltr;unicode-bidi:embed;
margin-top:0in;margin-bottom:0in" type="circle">
<li style="margin-top:0;margin-bottom:0;vertical-align:middle">
<p>
<span style="font-family:Calibri;font-size:11.0pt">If not, why not?</span></p>
</li>
</ul>
</li>
</ul>
<ul dir="ltr" style="margin-left:.375in;direction:ltr;unicode-bidi:embed;
margin-top:0in;margin-bottom:0in" type="disc">
<li style="margin-top:0;margin-bottom:0;vertical-align:middle">
<p>
<span style="font-family:Calibri;font-size:11.0pt">In defining and establishing policies for information governance, do you identify, describe, or enforce any standards?</span></p>
</li>
</ul>
<ul dir="ltr" style="margin-left:.375in;direction:ltr;unicode-bidi:embed;
margin-top:0in;margin-bottom:0in" type="disc">
<li style="margin-top:0;margin-bottom:0;vertical-align:middle">
<p>
<span style="font-family:Calibri;font-size:11.0pt">Who can see the information about your approaches and practices for information governance?</span></p>
</li>
</ul>
<p dir="ltr" style="margin:0in;font-family:Calibri;font-size:11.0pt">
&nbsp;</p>
<p dir="ltr" style="margin:0in;font-family:Calibri;font-size:11.0pt">
I&#39;m curious to see how many of you are considering your information governance program within your information governance initiatives.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
&nbsp;</p>
Last year concluded with the release of my developerWorks article and content for IBM&#39;s Information Governance Catalog representing a good portion of my journey in the information landscape last year.&nbsp; (See:&nbsp; Establish an information governance...001290urn:lsid:ibm.com:blogs:entries-17b40b29-ada7-4f6c-82c9-8fdc43b1cf50Journeys in the Information Landscape2015-01-07T17:11:57-05:00urn:lsid:ibm.com:blogs:entry-7cda6b05-2076-4b27-b340-7d29e1f46f35Location, Location, Location - part 2, the Virtual Varietysmithha110000PAKNactivefalseComment EntriesLikestrue2013-10-25T16:48:49-04:002013-10-25T16:48:49-04:00<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
Within an hour of posting my last blog on the Variety of Location data, I remembered three other sets of Location data that I did not include but should note:&nbsp; <span style="font-weight:bold;">Electronic Addresses, Voice/Phone Addresses </span>and <span style="font-weight:bold;">Virtual World Addresses.</span></p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
<span style="font-weight:bold;">Adding in the Electronic World</span></p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
There&#39;s a whole virtual world out there that&#39;s grown around us, but you can&#39;t attach geospatial coordinates to: the realm on email addresses, IP addresses, and URLs.&nbsp; Should these be considered Locations as well?</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
I think a reasonable case can be made to do so.</p>
<p dir="ltr" style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-weight:normal;font-style:normal;"><span style="font-size:11.0pt;"><span style="font-family:calibri;"><span style="font-size:11.0pt;"><span style="font-family:calibri;">1) Consider a customer:&nbsp; they have a physical address you can ship goods to; and they have an electronic address that you can send a shipment notice to.&nbsp; </span></span></span></span></span></p>
<ul dir="ltr">
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-weight:normal;font-style:normal;"><span style="font-size:11.0pt;"><span style="font-family:calibri;"><span style="font-size:11.0pt;"><span style="font-family:calibri;">Both represent routes to the customer.</span></span></span></span></span></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;" value="2">
<span style="font-weight:normal;font-style:normal;"><span style="font-size:11.0pt;"><span style="font-family:calibri;"><span style="font-size:11.0pt;"><span style="font-family:calibri;">Both are distinct from the customer and the actual data about the customer.</span></span></span></span></span></li>
</ul>
<p dir="ltr" style="margin-top: 0px; margin-bottom: 0px; vertical-align: middle;">
<span style="font-weight:normal;font-style:normal;"><span style="font-size:11.0pt;"><span style="font-family:calibri;"><span style="font-size:11.0pt;"><span style="font-family:calibri;">2) Consider their uniqueness:&nbsp; a given physical address (once you&#39;ve accounted for subdivisions such as apartments and floors) is a unique place; a given electronic address is also unique - while it might be accessed from many virtual points, my email address or the IP address I&#39;m currently connected to are distinct from any others.</span></span></span></span></span></p>
<p dir="ltr" style="margin-top: 0px; margin-bottom: 0px; vertical-align: middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">3) Consider that they are routing mechanisms: they serve as endpoints for their relevant protocols to deliver content to.</span></span></p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
Generally, the electronic addresses are less complex in content than physical addresses, but I think it worthwhile to consider this set as more of the larger variety of Location data.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
<span style="font-weight:bold;">And what of the World of Voice?</span></p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
Phones are an interesting crossing point.&nbsp; Through most of their history they are physical devices, whether in the form of landline phones in your home, fax machines in the office, or mobile phones in your pocket.&nbsp; At any a given point in time, they occupy a physical space which can be described as a Physical Address or as a Geospatial Coordinate.&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
However, they also have a phone number which serves as a Voice Address -- I can call your phone number to reach you just as I can send a letter to your physical address or an email to your electronic one.&nbsp; Those phone numbers are unique at a given point in time and are clearly distinct from either the user of the phone or the premise at which the phone currently resides. These Voice Addresses are also distinct from the serial numbers that uniquely indicates that your iPhone is distinct from your friend&#39;s iPhone.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
Now the distinctions start to blur when you consider Conference Numbers and VoIP - these are virtual addresses, numbers you can dial from some other phone or machine to reach someone else but are not necessarily at any specific physical location or linked to any specific device.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
<span style="font-weight:bold;">Emergence of Virtual Locations</span></p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
Given the blurring of boundaries between Electronic Address and Voice Address, it may be useful to consider these as Virtual Locations.&nbsp; And that in turn allows us to bring Virtual World Locations in to a common picture as well.&nbsp; I see this aspect in the online games my kids play where they decide which server or world they want to participate in at any given time.&nbsp; Like conference call numbers, these virtual locations have specific identities to select and often restrictions on the number of connections supported at a given point in time.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
<span style="font-style:italic;">Why do we want to include these electronic, voice, and virtual addresses in our consideration of Location data?</span>&nbsp; In many instances, these are the only Location data we have for a given customer, client, vendor, etc.&nbsp; Many of our interactions are solely electronic.&nbsp; Where a product is&nbsp; virtual, such as an eBook or a pdf document or even a tax form, delivery of goods and acknowledgment of receipt is based on these locations.&nbsp; Our understanding of customers and suppliers is increasingly shaped by our awareness of the Virtual World as well as the physical one.&nbsp; <span style="font-style:italic;">What do you think?&nbsp; Do you agree with this broad perspective on Location?</span></p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
In my next post, I&#39;ll look take up the increasing sources of Location data and their possible uses and interactions as I think this is a key to determining whether it is worthwhile to turn Location into Master Data, and if so, which pieces.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
As always, the postings on this site are my own and don&#39;t necessarily represent IBM&#39;s positions, strategies or opinions.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
&nbsp;</p>
Within an hour of posting my last blog on the Variety of Location data, I remembered three other sets of Location data that I did not include but should note:&nbsp; Electronic Addresses, Voice/Phone Addresses and Virtual World Addresses. &nbsp; Adding in the...001775urn:lsid:ibm.com:blogs:entries-17b40b29-ada7-4f6c-82c9-8fdc43b1cf50Journeys in the Information Landscape2015-01-07T17:11:57-05:00urn:lsid:ibm.com:blogs:entry-270a2292-1a26-4c2e-91a6-0c7f95d3a20aLocation, Location, Location - part 1, Varietysmithha110000PAKNactivefalseComment EntriesLikestrue2013-10-24T09:53:37-04:002013-10-24T09:53:37-04:00<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
Henrik Liliendahl S&oslash;rensen&#39;s recent Blog entry <a href="http://liliendahl.com/2013/10/19/growing-variety-in-big-master-data/">Growing Variety in Big Master Data</a> got me thinking more about Location as Master Data.&nbsp; And rather than trying to fit a lot of thoughts into a single blog post, I&#39;ve divided up some thoughts around topics of Variety, Use, and Life cycle of Location data.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
<span style="font-weight:bold;">The Many Varieties of Location</span></p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
When someone says &quot;Location&quot;, our first instinct points to a <span style="font-weight:bold;">physical address</span>: a place where someone lives or works; or a place you can send mail to; or a place you can get directions to.&nbsp; There&#39;s standard information we associate with a physical address:</p>
<ul dir="ltr">
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">A street:&nbsp; typically with a building number, a street name and street type, maybe a directional indicator, maybe a more specific unit or apartment number</span></span></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">A city:&nbsp; a larger geographical area containing the street</span></span></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">A state or province:&nbsp; another layer in the geographical hierarchy</span></span></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">A postal code: a smaller geographical area containing the street, usually for mailing purposes</span></span></li>
</ul>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
This type of location information may be free-form, highly parsed, or somewhere in-between with varying degrees of accuracy.</p>
<div dir="ltr" style="clear:both;">
<div style="direction:ltr">
<table border="1" cellpadding="0" cellspacing="0" style="direction:ltr;
border-collapse:collapse;border-style:solid;border-color:#A3A3A3;border-width:
1pt" valign="top">
<tbody>
<tr>
<td style="border-style:solid;border-color:#A3A3A3;border-width:1pt;
vertical-align:top;width:1.2888in;padding:4pt 4pt 4pt 4pt">
<p style="margin:0in;font-family:Calibri;font-size:11.0pt">
<strong>Address 1</strong></p>
</td>
<td style="border-style:solid;border-color:#A3A3A3;border-width:1pt;
vertical-align:top;width:1.6125in;padding:4pt 4pt 4pt 4pt">
<p style="margin:0in;font-family:Calibri;font-size:11.0pt">
<strong>Address 2</strong></p>
</td>
<td style="border-style:solid;border-color:#A3A3A3;border-width:1pt;
vertical-align:top;width:.6673in;padding:4pt 4pt 4pt 4pt">
<p style="margin:0in;font-family:Calibri;font-size:11.0pt">
<strong>City</strong></p>
</td>
<td style="border-style:solid;border-color:#A3A3A3;border-width:1pt;
vertical-align:top;width:.6673in;padding:4pt 4pt 4pt 4pt">
<p style="margin:0in;font-family:Calibri;font-size:11.0pt">
<strong>State</strong></p>
</td>
<td style="border-style:solid;border-color:#A3A3A3;border-width:1pt;
vertical-align:top;width:.9909in;padding:4pt 4pt 4pt 4pt">
<p style="margin:0in;font-family:Calibri;font-size:11.0pt">
<strong>Postal Code</strong></p>
</td>
</tr>
<tr>
<td style="border-style:solid;border-color:#A3A3A3;border-width:1pt;
vertical-align:top;width:1.2888in;padding:4pt 4pt 4pt 4pt">
<p style="margin:0in;font-family:Calibri;font-size:11.0pt">
22 Main St.</p>
</td>
<td style="border-style:solid;border-color:#A3A3A3;border-width:1pt;
vertical-align:top;width:1.6125in;padding:4pt 4pt 4pt 4pt">
<p style="margin:0in;font-family:Calibri;font-size:11.0pt">
Apt 1B</p>
</td>
<td style="border-style:solid;border-color:#A3A3A3;border-width:1pt;
vertical-align:top;width:.6673in;padding:4pt 4pt 4pt 4pt">
<p style="margin:0in;font-family:Calibri;font-size:11.0pt">
Salem</p>
</td>
<td style="border-style:solid;border-color:#A3A3A3;border-width:1pt;
vertical-align:top;width:.6673in;padding:4pt 4pt 4pt 4pt">
<p style="margin:0in;font-family:Calibri;font-size:11.0pt">
MA</p>
</td>
<td style="border-style:solid;border-color:#A3A3A3;border-width:1pt;
vertical-align:top;width:.9909in;padding:4pt 4pt 4pt 4pt">
<p style="margin:0in;font-family:Calibri;font-size:11.0pt">
01970</p>
</td>
</tr>
<tr>
<td style="border-style:solid;border-color:#A3A3A3;border-width:1pt;
vertical-align:top;width:1.2888in;padding:4pt 4pt 4pt 4pt">
<p style="margin:0in;font-family:Calibri;font-size:11.0pt">
22 Main Street</p>
</td>
<td style="border-style:solid;border-color:#A3A3A3;border-width:1pt;
vertical-align:top;width:1.6125in;padding:4pt 4pt 4pt 4pt">
<p style="margin:0in;font-family:Calibri;font-size:11.0pt">
&nbsp;</p>
</td>
<td style="border-style:solid;border-color:#A3A3A3;border-width:1pt;
vertical-align:top;width:.6673in;padding:4pt 4pt 4pt 4pt">
<p style="margin:0in;font-family:Calibri;font-size:11.0pt">
Salem</p>
</td>
<td style="border-style:solid;border-color:#A3A3A3;border-width:1pt;
vertical-align:top;width:.6673in;padding:4pt 4pt 4pt 4pt">
<p style="margin:0in;font-family:Calibri;font-size:11.0pt">
MA</p>
</td>
<td style="border-style:solid;border-color:#A3A3A3;border-width:1pt;
vertical-align:top;width:.9909in;padding:4pt 4pt 4pt 4pt">
<p style="margin:0in;font-family:Calibri;font-size:11.0pt">
01970</p>
</td>
</tr>
<tr>
<td style="border-style:solid;border-color:#A3A3A3;border-width:1pt;
vertical-align:top;width:1.2888in;padding:4pt 4pt 4pt 4pt">
<p style="margin:0in;font-family:Calibri;font-size:11.0pt">
22 N Main St</p>
</td>
<td style="border-style:solid;border-color:#A3A3A3;border-width:1pt;
vertical-align:top;width:1.6125in;padding:4pt 4pt 4pt 4pt">
<p style="margin:0in;font-family:Calibri;font-size:11.0pt">
&nbsp;</p>
</td>
<td style="border-style:solid;border-color:#A3A3A3;border-width:1pt;
vertical-align:top;width:.6673in;padding:4pt 4pt 4pt 4pt">
<p style="margin:0in;font-family:Calibri;font-size:11.0pt">
Salem</p>
</td>
<td style="border-style:solid;border-color:#A3A3A3;border-width:1pt;
vertical-align:top;width:.6673in;padding:4pt 4pt 4pt 4pt">
<p style="margin:0in;font-family:Calibri;font-size:11.0pt">
MA</p>
</td>
<td style="border-style:solid;border-color:#A3A3A3;border-width:1pt;
vertical-align:top;width:.9909in;padding:4pt 4pt 4pt 4pt">
<p style="margin:0in;font-family:Calibri;font-size:11.0pt">
01970</p>
</td>
</tr>
<tr>
<td style="border-style:solid;border-color:#A3A3A3;border-width:1pt;
vertical-align:top;width:1.2888in;padding:4pt 4pt 4pt 4pt">
<p style="margin:0in;font-family:Calibri;font-size:11.0pt">
30 Witches Road</p>
</td>
<td style="border-style:solid;border-color:#A3A3A3;border-width:1pt;
vertical-align:top;width:1.6125in;padding:4pt 4pt 4pt 4pt">
<p style="margin:0in;font-family:Calibri;font-size:11.0pt">
#B</p>
</td>
<td style="border-style:solid;border-color:#A3A3A3;border-width:1pt;
vertical-align:top;width:.6673in;padding:4pt 4pt 4pt 4pt">
<p style="margin:0in;font-family:Calibri;font-size:11.0pt">
Salem</p>
</td>
<td style="border-style:solid;border-color:#A3A3A3;border-width:1pt;
vertical-align:top;width:.6673in;padding:4pt 4pt 4pt 4pt">
<p style="margin:0in;font-family:Calibri;font-size:11.0pt">
MA</p>
</td>
<td style="border-style:solid;border-color:#A3A3A3;border-width:1pt;
vertical-align:top;width:.9909in;padding:4pt 4pt 4pt 4pt">
<p style="margin:0in;font-family:Calibri;font-size:11.0pt">
01970</p>
</td>
</tr>
<tr>
<td style="border-style:solid;border-color:#A3A3A3;border-width:1pt;
vertical-align:top;width:1.2888in;padding:4pt 4pt 4pt 4pt">
<p style="margin:0in;font-family:Calibri;font-size:11.0pt">
300 Witch Way</p>
</td>
<td style="border-style:solid;border-color:#A3A3A3;border-width:1pt;
vertical-align:top;width:1.6125in;padding:4pt 4pt 4pt 4pt">
<p style="margin:0in;font-family:Calibri;font-size:11.0pt">
&nbsp;</p>
</td>
<td style="border-style:solid;border-color:#A3A3A3;border-width:1pt;
vertical-align:top;width:.6673in;padding:4pt 4pt 4pt 4pt">
<p style="margin:0in;font-family:Calibri;font-size:11.0pt">
Salem</p>
</td>
<td style="border-style:solid;border-color:#A3A3A3;border-width:1pt;
vertical-align:top;width:.6673in;padding:4pt 4pt 4pt 4pt">
<p style="margin:0in;font-family:Calibri;font-size:11.0pt">
MA</p>
</td>
<td style="border-style:solid;border-color:#A3A3A3;border-width:1pt;
vertical-align:top;width:.9909in;padding:4pt 4pt 4pt 4pt">
<p style="margin:0in;font-family:Calibri;font-size:11.0pt">
01970</p>
</td>
</tr>
<tr>
<td style="border-style:solid;border-color:#A3A3A3;border-width:1pt;
vertical-align:top;width:1.2888in;padding:4pt 4pt 4pt 4pt">
<p style="margin:0in;font-family:Calibri;font-size:11.0pt">
US Post Office</p>
</td>
<td style="border-style:solid;border-color:#A3A3A3;border-width:1pt;
vertical-align:top;width:1.6125in;padding:4pt 4pt 4pt 4pt">
<p style="margin:0in;font-family:Calibri;font-size:11.0pt">
2 Margin St, 20th floor</p>
</td>
<td style="border-style:solid;border-color:#A3A3A3;border-width:1pt;
vertical-align:top;width:.6673in;padding:4pt 4pt 4pt 4pt">
<p style="margin:0in;font-family:Calibri;font-size:11.0pt">
Salem</p>
</td>
<td style="border-style:solid;border-color:#A3A3A3;border-width:1pt;
vertical-align:top;width:.6673in;padding:4pt 4pt 4pt 4pt">
<p style="margin:0in;font-family:Calibri;font-size:11.0pt">
MA</p>
</td>
<td style="border-style:solid;border-color:#A3A3A3;border-width:1pt;
vertical-align:top;width:.9909in;padding:4pt 4pt 4pt 4pt">
<p style="margin:0in;font-family:Calibri;font-size:11.0pt">
01970</p>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
As we&#39;ve seen for awhile, there are a range of information quality challenges to address here if we&#39;re thinking about Location Master Data such as the standardization and verification of the data.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
<span style="font-weight:bold;">A Place on the Globe</span></p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
More frequently these days, &quot;Location&quot; may describe a <span style="font-weight:bold;">geospatial coordinate</span>:&nbsp; a reference to a specific point on the Earth, typically expressed as x, y, and z coordinates that represent longitude, latitude, and elevation.&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
For instance, doing a quick search on Google for coordinates for Salem MA, noted above, provides the following latitude and longitude: 42.5168&deg; N, 70.8985&deg; W.<a href="https://www.ibm.com/developerworks/community/blogs/haraldsmith/resource/BLOGS_UPLOADED_IMAGES/salemMA.gif" target="_blank"><img alt="image" src="https://www.ibm.com/developerworks/community/blogs/haraldsmith/resource/BLOGS_UPLOADED_IMAGES/salemMA.gif" style=" display:block; margin: 1em 0pt 0pt 1em; float: right;" /></a>&nbsp; The Google search image at right is what you&#39;ll see from a satellite perspective.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
However, a slightly different search parameter gives me:&nbsp;</p>
<ul dir="ltr">
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">Latitude: N 42&deg; 31&#39; 10.344&quot;</span></span></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">Longitude: W 70&deg; 53&#39; 48.1758&quot;</span></span></li>
</ul>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
Another variation gives me:</p>
<ul dir="ltr">
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">Latitude: N 42&deg; 31.1724&#39;</span></span></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">Longitude: W 70&deg; 53.80293&#39;</span></span></li>
</ul>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
And yet another provides:</p>
<ul dir="ltr">
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">Latitude: 42.51954&deg;</span></span></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">Longitude: -70.896716&deg;</span></span></li>
</ul>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
Depending on the source, you could get any of these, so again from a Master Data perspective there are some standardization aspects to address:&nbsp; should it include a hemisphere or just a positive/negative scale; or should it measure by degrees/minutes/seconds or incorporate a decimal division of degree or minutes?</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
It&#39;s worthwhile to note too that a geospatial coordinate for a city, state, or country is likely to be an arbitrary point within that city, state, or country.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
<span style="font-weight:bold;">Location as a Shape or Group<a href="https://www.ibm.com/developerworks/community/blogs/haraldsmith/resource/BLOGS_UPLOADED_IMAGES/potomac.gif" target="_blank"><img alt="image" src="https://www.ibm.com/developerworks/community/blogs/haraldsmith/resource/BLOGS_UPLOADED_IMAGES/potomac.gif" style=" display:block; margin: 1em 0pt 0pt 1em; float: right;" /></a></span></p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
Not all locations are described by a postal address or single geospatial coordinate, though.&nbsp; Some of these are <span style="font-weight:bold;">descriptive locations</span>.&nbsp; Others are frequently described as<span style="font-weight:bold;"> shapes, shape files, </span>or<span style="font-weight:bold;"> polygons.&nbsp; </span>Consider some of the following:</p>
<ul dir="ltr">
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">Crime incidents may use a less specific address such as a block site like: </span></span>
<ul>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">3500 - 3599 BLOCK OF 19TH STREET SE</span></span></li>
</ul>
</li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">Utility locations may use directional information useful for a human: </span></span>
<ul>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">GO 100 FEET NW FROM END OF KING ST</span></span></li>
</ul>
</li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">Bus Routes or Bike Lanes require data that can describe an entire shape such as the Potomac Trail in Washington, DC shown at right&nbsp; [see: </span></span><a href="http://bikewashington.org/routes/potomac/potomac.htm">http://bikewashington.org/routes/potomac/potomac.htm</a><span style="font-size:11.0pt;"><span style="font-family:calibri;">] and include points that cross over each other.</span></span></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;"><span style="font-size:11.0pt;"><span style="font-family:calibri;">Census Tracts (e.g. County Lines, Zip Codes, etc.) require data that can describe a boundary (another type of shape)</span></span></span></span></li>
</ul>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
And there are also <span style="font-weight:bold;">hierarchical groupings</span> that may consist of any of the above.&nbsp; Consider a Sales Region that comprises several Zip Codes that cross two states (e.g. Kansas City, MO and Kansas City, KS).&nbsp; In turn it falls under a larger Region such as Midwest US perhaps based on States instead of Zip Codes; which falls under Western US; and then North American Sales.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
<span style="font-weight:bold;">Finding that perfect Location</span></p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
With these variants in type of Location Data, <span style="font-style:italic;">which is the right one to store for Master Data?</span>&nbsp;</p>
<ul dir="ltr">
<li style="font-family: Calibri; font-size: 11pt;">
Should they all be standardized to one type?&nbsp; You can assign geospatial coordinates to addresses, and potentially to descriptive information if you can get it, so adding in that location detail may make sense for future use.&nbsp; That might suggest that Locations should be stored by geospatial coordinates.&nbsp;</li>
<li style="font-family: Calibri; font-size: 11pt;">
But a Route or a County is a set of geospatial coordinates, either a set of line segments or a set of boundary points -- should those be consolidated with addresses or kept distinct?&nbsp;</li>
<li style="font-family: Calibri; font-size: 11pt;">
And groupings such as Sales Districts may be primarily based on Locations which are Address-based (this set of streets, this set of states or countries) -- does it make sense to add geospatial boundaries to those?</li>
</ul>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
<em>Perhaps all types be stored either distinctly or with geospatial references where available?</em>&nbsp; I think this depends heavily on the sources of incoming information and what you want to do with the information downstream.&nbsp; I&#39;ll look further at the increasing sources of Location data and possible uses in my next post, and in the meantime, I&#39;m interested to hear <strong>what other varieties of Location data you&#39;ve encountered</strong>.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
As always, the postings on this site are my own and don&#39;t necessarily represent IBM&#39;s positions, strategies or opinions.</p>
Henrik Liliendahl S&oslash;rensen&#39;s recent Blog entry Growing Variety in Big Master Data got me thinking more about Location as Master Data.&nbsp; And rather than trying to fit a lot of thoughts into a single blog post, I&#39;ve divided up some thoughts...102468urn:lsid:ibm.com:blogs:entries-17b40b29-ada7-4f6c-82c9-8fdc43b1cf50Journeys in the Information Landscape2015-01-07T17:11:57-05:00urn:lsid:ibm.com:blogs:entry-cec123e6-2b37-426a-8a38-4b0fb1852a8cBig Data Lake or Big Data Landfill?smithha110000PAKNactivefalsesmithha110000PAKNactivefalseComment EntriesLikestrue2013-10-02T17:56:02-04:002013-10-02T17:56:02-04:00<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
I&#39;m in the midst of moving to a new laptop.&nbsp; While the new laptop offers the promise of faster speed, more CPU, and more disk space, there&#39;s the usual challenge of getting everything configured and getting all my old files moved.&nbsp; And, unfortunately, I&#39;m also one of those people who saves a lot of stuff in a lot of files over the years.&nbsp;&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
As with any move, whether physical or electronic,&nbsp; I&#39;m immediately faced by the question:&nbsp; <span style="font-weight:bold;">do I really need to bring this stuff along or can I finally get rid of it?</span></p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
On the plus side, I tend to have everything categorized in a couple levels of folders that I can make sense of quickly.&nbsp; On the down side, that level of organizing means a lengthy process to review folders and determine what to keep or throw away.&nbsp; For instance, my general knowledge base folder contains 100 separate folders, each typically containing 5-30 files.&nbsp; Some of the detail folders are fairly static at this point while others such as my BigData folder are actively growing.&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
I&#39;ve got some basic tools to help me make decisions to keep or discard.&nbsp; Generally, I recognize file names and rough level of content.&nbsp; I can use tools to assess when the files were created, updated, or last accessed.&nbsp; For something like my knowledge base, most likely I&#39;ll bring over the whole high-level folder just to make sure I don&#39;t lose anything I need.&nbsp; After all, there&#39;s still a reasonable size limit since the overall contents are bounded by the size of my current hard drive.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
<span style="font-weight:bold;">Jumping into the Big Data Lake</span></p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
So what&#39;s this small example have to do with Big Data?&nbsp; To me, it illustrates one of the key governance challenges facing Big Data.&nbsp; The concept of the Big Data Lake emerged around two years ago (see:&nbsp; <a href="http://www.forbes.com/sites/ciocentral/2011/07/21/big-data-requires-a-big-new-architecture/">Big Data Requires a Big, New Architecture</a>).&nbsp; In general, the Data Lake allows organizations to store&quot;the data in a massive, easily accessible repository based on the cheap storage that&rsquo;s available today. Then, when there are questions that need answers, that is the time to organize and sift through the chunks of data that will provide those answers.&quot;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
Tools have developed to support a lot of ways to jump in and get at all that Big Data.&nbsp; IBM&#39;s InfoSphere Data Explorer is one example &quot;to help users of all kinds find and share information more easily and to help organizations launch big data initiatives more quickly&quot; (see: <a href="http://www-03.ibm.com/software/products/us/en/dataexplorer/">IBM InfoSphere Data Explorer</a>).</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
Just as I can browse my own local laptop directory, the various Big Data tools allow us to search, find, tag, explore, and provision the data in the Big Data Lake.&nbsp; And we work with this Big Data with the goal of finding those really valuable diamonds -- information that we can drive new business insight with.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
<span style="font-weight:bold;">Pulling up Old Shoes</span></p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
With a lot of people and processes adding data into the Big Data Lake, there&#39;s a lot of opportunity for that lake to turn into a Big Data Swamp or perhaps worse a Big Data Landfill!&nbsp; There will be a strong tendency to treat such Big Data Lakes as landing zones in which to put anything of potential use for subsequent analysis--landing areas perceived to have unlimited storage capacity as well.&nbsp; Instead of working with a small set of known directories as on a laptop, you may be looking at hundreds or thousands of directories with untold numbers of files, often with cryptic names.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
Consider the following partial directory listing from a test environment:</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
<a href="https://www.ibm.com/developerworks/community/blogs/haraldsmith/resource/BLOGS_UPLOADED_IMAGES/BigDataContentDirectory.png" style="width: 100%; display: inline-block;" target="_blank"><img alt="image" src="https://www.ibm.com/developerworks/community/blogs/haraldsmith/resource/BLOGS_UPLOADED_IMAGES/BigDataContentDirectory.png" style=" display:block; margin: 1em 0pt 0pt 0pt; float: left;" /></a></p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
In this case, there is very little information to go on unless you open each and every file (and maybe not even then depending on the data format), or hope that the tools you have available can give you more insight.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
As the volume and variety of these files increases and their velocity or frequency of arrival also expands, users fall back on what they know and have personal confidence in.&nbsp; That at least increases the likelihood of pulling up something of interest rather than the flotsam and jetsam of the Big Data Lake.&nbsp; But it also may diminish the value of the Big Data.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
<span style="font-weight:bold;">Retain or Remove?</span></p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
When working with files on my laptop, I have the advantage of knowing when they were created, what they contain, when they were last used, and most importantly how valuable they are to my work.&nbsp; With hundreds or thousands of users, great volumes of files created by users or by automated processes, and likely little understanding of who else is using a given file and why, there&#39;s an immediate challenge in managing and governing this Big Data Lake.&nbsp; Add to that the ever-changing nature of an organization where users who add and understand content move to new roles or leave the organization, there&#39;s also an increasing likelihood that a lot of data will exist that is, in effect, orphaned.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
<em>One aspect of Information Governance in the Big Data context is how we manage <strong>the lifecycle of this data</strong>.&nbsp;</em> These are fundamentally policy questions supported by people and process, with tools as facilitators not dictators.&nbsp; Questions to address for this Big Data Lake include:&nbsp;</p>
<ul dir="ltr">
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">How long will the organization retain this data?</span></span>
<ul>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">If the data is used in making certain kinds of business decisions, are there policies that dictate this retention period?</span></span><!-- --></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">If part of the value in Big Data is finding unexpected trends over time, is there value in retaining some of this data to increase the likelihood of finding those trends?</span></span><!-- --></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">Are there ways to readily categorize the data between what only has immediate, time-sensitive value and what has longer-term value?</span></span></li>
</ul>
</li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">Will some of this data be moved to historical or archived locations?</span></span>
<ul>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">If so, will there be any different approach to finding, accessing, and utilizing this data?</span></span></li>
</ul>
</li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;"><span style="font-size:11.0pt;"><span style="font-family:calibri;">How will the data be disposed of?</span></span></span></span>
<ul>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">If the data contains content of particularly sensitive nature, are there policies that dictate the disposal practice?</span></span></li>
</ul>
</li>
</ul>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
All of these questions raise considerations for an organization as part of their Information Governance program.&nbsp; Given my own current migration process, <strong>I&#39;m curious if your organization is addressing these aspects of Information Lifecycle Management in it&#39;s Big Data context?</strong></p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
&nbsp;</p>
<p dir="ltr" style="margin:0in;font-family:Calibri;font-size:11.0pt">
As always, the postings on this site are my own and don&#39;t necessarily represent IBM&#39;s positions, strategies or opinions.</p>
I&#39;m in the midst of moving to a new laptop.&nbsp; While the new laptop offers the promise of faster speed, more CPU, and more disk space, there&#39;s the usual challenge of getting everything configured and getting all my old files moved.&nbsp; And,...004843urn:lsid:ibm.com:blogs:entries-17b40b29-ada7-4f6c-82c9-8fdc43b1cf50Journeys in the Information Landscape2015-01-07T17:11:57-05:00urn:lsid:ibm.com:blogs:entry-67e836df-2d08-4bf2-b5e8-2ac7289d1d62The Right Stuff - Building the Right Data Science Teamsmithha110000PAKNactivefalseComment EntriesLikestrue2013-09-03T08:59:39-04:002013-09-03T08:59:39-04:00<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
I&#39;ve recently seen a lot of questions and discussions about the emerging role of the Data Scientist:&nbsp; who they are; what they do; how or why they are different from data analysts; and why they are critical to the success of Big Data in an organization.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
From some of the discussions it often seems like you cannot hope to get any value from Big Data unless you have not just a Data Scientist but a team of Data Scientists at work.&nbsp; There is a lot that the Data Scientist brings to the table:&nbsp; scientific methodology, higher-level statistical analysis, mathematical modeling.&nbsp; But I&#39;ve often had the feeling reading through these discussions that there were missing elements.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
So I was very pleased to see the article &quot;<a href="http://www.information-management.com/news/big-insights-from-big-data-require-the-right-data-science-team-10024806-1.html?ET=informationmgmt:e9392:2135510a:&amp;st=email">Big Insights from Big Data Require the Right Data Science Team</a>&quot; from Information Management discussing the insights of Booz Allen Hamilton&#39;s consulting team and pointing to their <a href="http://www.boozallen.com/media/file/DataScience_Infographic_Final.pdf">Data Science infographic</a>.&nbsp; The article notes:&nbsp; &quot;The right data science teams <span style="font-weight:bold;">blend</span> the technical expertise of computer scientists and mathematicians and statisticians with a critically-important, but overlooked, element&mdash;<span style="font-weight:bold;">domain knowledge</span>.&quot;&nbsp; This lines up with my experience and what I am seeing emerging in the industry.&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
<span style="font-weight:bold;">The Data Scientist</span> brings in these new skill sets, particularly useful as you extend the range of the data in use beyond the <a href="https://www.ibm.com/developerworks/community/blogs/haraldsmith/resource/BLOGS_UPLOADED_IMAGES/iscp_icon_data_scientist_user.png" target="_blank"><img alt="image" src="https://www.ibm.com/developerworks/community/blogs/haraldsmith/resource/BLOGS_UPLOADED_IMAGES/iscp_icon_data_scientist_user.png" style=" display:block; margin: 1em 1em 0pt 0pt; float: left;" /></a>organizational basics, that address the &quot;what&quot; in the equation -- what steps need to be taken, what models are applicable, what might the results indicate.&nbsp; They bring a scientific rigor to Big Data.&nbsp; However, as the article notes, they are not the only role involved.&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
<span style="font-weight:bold;">The Information Architect</span>, who could range from a data analyst in some organizations to a data integration specialist to the computer scientist noted in the Booz Allen Hamilton infographic, represents the skill set to make Big Data happen.&nbsp; They address the &quot;how&quot; in the equation.&nbsp;&nbsp; Where you have diverse sets of data with different structures (relational,&nbsp; unstructured, semi-structured), different data types and formats, or even different timing intervals, you need skills to figure out how to put the data together in a meaningful and useful way.&nbsp; This person can understand data models and pull data from traditional sources, work with Hadoop, utilize ETL tools, and put reports together.</p>
<p dir="ltr">
<a href="https://www.ibm.com/developerworks/community/blogs/haraldsmith/resource/BLOGS_UPLOADED_IMAGES/iscp_icon_data_analyst_user.png" target="_blank"><img alt="image" src="https://www.ibm.com/developerworks/community/blogs/haraldsmith/resource/BLOGS_UPLOADED_IMAGES/iscp_icon_data_analyst_user.png" style=" display:block; margin: 1em 1em 0pt 0pt; float: left;" /></a></p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
And <span style="font-weight:bold;">the Domain Expert</span>, whether a business analyst or a subject matter expert or the data steward, brings in the business insight to help identify areas of impact, considerations about the business and the data, and provide a check and validation on the conclusions from the results.&nbsp; These experts have seen the data used in the business processes and they know their industry.&nbsp; They understand when data looks &#39;right&#39; and when it does not.&nbsp;</p>
<p dir="ltr">
&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
The ultimate value, though, is in the blend of the skill sets.&nbsp; The infographic comments that &quot;the ability to fuse disparate, seemingly unrelated data&mdash; like financial transaction information, payment records and exchange rates&mdash; can produce an entirely new level of insight and direction.&quot;&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
This is the value proposition that Big Data can enable, but as with most initiatives it comes back to the <strong>old equation of people, process, and technology</strong> with the team of people providing the right Data Science stuff.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
As always, the postings on this site are my own and don&#39;t necessarily represent IBM&#39;s positions, strategies or opinions.</p>
I&#39;ve recently seen a lot of questions and discussions about the emerging role of the Data Scientist:&nbsp; who they are; what they do; how or why they are different from data analysts; and why they are critical to the success of Big Data in an...001973urn:lsid:ibm.com:blogs:entries-17b40b29-ada7-4f6c-82c9-8fdc43b1cf50Journeys in the Information Landscape2015-01-07T17:11:57-05:00urn:lsid:ibm.com:blogs:entry-c9c62ae8-00b1-46c6-b39c-7f8b3b7e7682BigData, Governance, and Emerging Data Warehouse Demandssmithha110000PAKNactivefalseComment EntriesLikestrue2013-08-20T10:28:43-04:002013-08-20T10:28:43-04:00<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
After a bit of a break for vacation, it&#39;s a good point to catch up on some items from the past month.&nbsp; One item I thought I&#39;d call out was the recent release of a new IBM Redbook <a href="http://www.redbooks.ibm.com/abstracts/sg248126.html?Open"><span style="text-decoration:underline;">IBM Information Server: Integration and Governance for Emerging Data Warehouse Demands</span></a> which I helped to write.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
I&#39;ve commented recently in this blog on trends in Big Data and some of the associated aspects of Information Governance.&nbsp; Both of these trends are impacting the way we traditionally look at and work with data warehouses, those centerpieces of many organizations&#39; enterprise information architecture.&nbsp; What we&#39;ve seen&nbsp; organizations wrestling with include:</p>
<ul dir="ltr">
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">Demands for more and faster access to data to quickly accommodate changing business requirements</span></span></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">Demands to incorporate and integrate more types of data at greater volumes and faster speeds than ever before</span></span></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">Demands to incorporate deeper analytical capabilities into the warehouse to predict customer churn, improve segmentation for marketing, etc.</span></span></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">Demands to improve the governance and raise the confidence of users in the breadth and quality of data stored in the warehouse</span></span></li>
</ul>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
In the Redbook, we talk about some of the recent additions to IBM&#39;s Information Server product line that help to meet these emerging challenges.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
For instance, IBM InfoSphere Data Click is designed to help a business user perform self-service operations to select and load data from a data warehouse to a data mart without requiring experience in designing a target model.&nbsp; At the same time, there are governance and quality requirements around the data to ensure that only certain data can be accessed and copied and that the right quality of data is delivered.&nbsp; These aspects are built into the InfoSphere Data Click design.&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
For the business user, what they get is a two-click experience selecting a prebuilt blueprint and then offloading the data to an environment where they can build and run the reports they need.&nbsp; For the IT staff and the data stewards, it&#39;s a configuration based approach to provide the business users with the right tools for easy access but without requiring the creation of complex scripts or database access since InfoSphere Data Click takes full advantage of the IBM Information Server processing and metadata functionality.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
To address the range of incoming Big Data sources to a data warehouse (or offload warehouse data to a Hadoop platform), IBM InfoSphere DataStage incorporates:</p>
<ul dir="ltr">
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">Usage of a Big Data File Stage to load data to or extract data from Hadoop systems</span></span></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">Capability to push down processing from an ETL flow design into Hadoop, taking advantage of the native processing power there</span></span></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">Integration with IBM InfoSphere Streams to integrate with real-time, low-latency analytics processing</span></span></li>
</ul>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
These additions allow for broad integration between Big Data and the traditional warehouse data.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
From the governance perspective, IBM Information Server now supports information governance policies and rules within its business glossary, allowing data stewards to connect more of the information landscape together and tie it into the governance requirements of the organization.&nbsp;&nbsp; These capabilities naturally support the needs and questions of an information governance organization such as:</p>
<ul dir="ltr">
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">What policies do we need to address?</span></span></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">What governance rules are incorporated in the policy?</span></span></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">What assets or data are governed by the policy?</span></span></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">What quality validations are needed to enforce a governance rule?</span></span></li>
</ul>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
By incorporating this type of information within a business glossary, users gain broader visibility into the overall governance requirements.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
If you&#39;re looking at any of these aspects of integration with or governance over your data warehouse, have a look into some of the new capabilities we note in the Redbook.&nbsp;&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
As always, the postings on this site are my own and don&#39;t necessarily represent IBM&#39;s positions, strategies or opinions.</p>
After a bit of a break for vacation, it&#39;s a good point to catch up on some items from the past month.&nbsp; One item I thought I&#39;d call out was the recent release of a new IBM Redbook IBM Information Server: Integration and Governance for Emerging Data...102857urn:lsid:ibm.com:blogs:entries-17b40b29-ada7-4f6c-82c9-8fdc43b1cf50Journeys in the Information Landscape2015-01-07T17:11:57-05:00urn:lsid:ibm.com:blogs:entry-1a295cc0-fc9d-4b6a-bc04-2fdeb94cd708Biting into Big Datasmithha110000PAKNactivefalseComment Entriesapplication/atom+xml;type=entryLikestrue2013-07-22T08:40:51-04:002013-07-22T08:40:51-04:00<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
I regularly check out the articles from FastCompany&#39;s twin sites Co.DESIGN <a href="http://www.fastcodesign.com/">http://www.fastcodesign.com/</a> and Co.LABS <a href="http://www.fastcolabs.com/">http://www.fastcolabs.com/</a>.&nbsp; I really enjoy their mix of informative and eclectic articles, and the former particularly incorporates some very interesting and intriguing infographics.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
As I caught up from my recent vacation, one recent infographic that caught my eye was called &quot;The United States of Burgers&quot; (<a href="http://www.fastcodesign.com/1673006/infographic-the-united-states-of-burgers#1">http://www.fastcodesign.com/1673006/infographic-the-united-states-of-burgers#1</a>), a somewhat whimsical look at the most popular fast food burger joints by city put together by PeekAnalytics (<a href="http://www.peekanalytics.com/burgerjoints/">http://www.peekanalytics.com/burgerjoints/</a>).&nbsp; They note on their site, &quot;For the past month, PeekAnalytics tracked millions of Tweets of fast food burger chains. This map shows which restaurant was the most popular in over 12,000 cities across the USA.&quot;&nbsp; You can look at a nation awash in burger joint logos, cull it back to ones of interest (and quickly see the dominance of golden arches, Burger Kings, and Wendy&#39;s across the country), zoom into particular states or even distinct cities.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
<span style="font-weight:bold;">What&#39;s in the Graphic?</span></p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
For those of us looking at Big Data and quality of information, there&#39;s some useful insights to gain from even this fun little graphic.&nbsp; After all, this is the crux of social media feeds -- culling out data that you can pair with your own internal data such as products and product sales.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
First, consider what we do know from the statement above and the infographic itself:</p>
<ul dir="ltr">
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">The data source is Twitter </span></span></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">The data covers a one month time period -- stated as the past month (probably June 2013 since the article appeared in July 2013)</span></span></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">The data is relevant for 12,000 cities in the US </span></span></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">There were millions of Tweets included</span></span></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">The Tweets had some mention of 26 named brands of fast food burger joints.</span></span></li>
</ul>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
We can look at the graphic at various levels and potentially ascertain various facts:</p>
<ul dir="ltr">
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">There are more McDonalds logos than Sonic logos</span></span></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">Krystal has clusters of popularity in North Carolina and Georgia </span></span></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">La Crosse, WI prefers Burger King, but Eau Claire, WI prefers What-a-Burger and River Falls, WI prefers Hardees </span></span></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">If I&#39;m travelling east on the interstate across central New Mexico, I&#39;m not going to find much choice unless I take a left turn at Albuquerque</span></span></li>
</ul>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
But consider what we don&#39;t inherently know:</p>
<ul dir="ltr">
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">What was the collection criteria?</span></span></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">Who made the tweets?&nbsp; </span></span></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">How were the brands identified? </span></span></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">How was popularity measured? </span></span></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">How was geographic location assessed?</span></span></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">What&#39;s really reflected in the marker for a city?</span></span></li>
</ul>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
<span style="font-weight:bold;">Is the data &#39;nutritionally&#39; valuable?</span></p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
As a data scientist or analyst, it&#39;s necessary to dig into what we don&#39;t know.&nbsp; Some of the important questions reflect basic aspects of data governance and quality, some reflect evaluation of the analysis in a larger context.&nbsp; Just thinking about the former, I could consider the following:</p>
<ul dir="ltr">
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">Is there a bias in the collection method?&nbsp; These tweets are made by people who have Twitter accounts and like to express either where they have been or an opinion about the burger joint.&nbsp; Or does the collector of the data have an interest in the data?&nbsp; While the data may be usable, it may not have sufficient quality to give realistic insight. </span></span></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">Was all relevant data collected?&nbsp; Suppose we forgot to include critical hashtags?&nbsp; Maybe a common reference to McDonalds is #MickeyD&#39;s and failure to include it significantly skews the results.&nbsp; How would we understand and record a level of completeness in the data? </span></span></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">Was context of the tweet captured and reflected?&nbsp; Did we discriminate between positive and negative comments or does that even matter?&nbsp; Should we record and measure something of this dimension (and what would we call it if we did)? </span></span></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">Was geography based on the Twitter user, the place of the tweet, or the place of business?&nbsp; Can we even tell?&nbsp; Potentially we could match the geographic coordinates to our known brand locations in this case to capture some distance measure.&nbsp; Perhaps the consistency of geographic coordinates to business location would help ensure better quality? </span></span></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">Should a distinction be made between cities where there is a clear preponderance of tweets for one brand vs. those where there is statistically insignificant variances between the top brands?&nbsp; Or do absences of certain brands reflect the lack of those brands in the city?&nbsp; At this point, there is a fine line between what may reflect a quality of data dimension (a measure of statistical significance) and an analytical or business dimension.</span></span></li>
</ul>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
<span style="font-weight:bold;">A dash of local knowledge</span></p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
Local knowledge can help considerably in looking at the data.&nbsp; I noted earlier that the most commonly referenced burger place in River Falls, WI was Hardees.&nbsp; I know River Falls well - it&#39;s where I grew up and still visit periodically.&nbsp; With a population of 15,000+ and a university, it&#39;s now the largest suburb of the greater St.Paul/Minneapolis region.&nbsp; The city currently has three fast food burger places:&nbsp; McDonalds and Burger King on the north edge of the city heading towards the interstate, and a Dairy Queen near the university.&nbsp; The closest Hardees is in Baldwin, WI, some 20 miles away (though there is a Hardees in Black River Falls, WI some 115 miles away), though if I remember correctly there once had been a Hardees in River Falls near the university, but some years ago now.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
What other questions can we ask based on this local knowledge?</p>
<ul dir="ltr">
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">Why is Hardees the most referenced burger place when there isn&#39;t one there?</span></span>
<ul>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">Does it reflect a comparison or preference for what is not there such as wishful thinking or nostalgia?</span></span></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">Does it reflect a proposal to bring a Hardees&#39; franchise into the city?</span></span></li>
</ul>
</li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">Is there something else in the tweets that we should correlate for to determine usefulness or value?</span></span></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">Were the tweets inappropriately linked to River Falls, WI when they were actually for Black River Falls, WI? (such as a potential failure/error in geospatial or matching logic.) </span></span></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">Should a filter or correlation of actual burger franchises have been applied against the data?&nbsp; Or is it valuable to see the range of references regardless of whether a burger joint actually exists in the city?</span></span></li>
</ul>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
<span style="font-weight:bold;">The detail or the aggregate?</span></p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
If we&#39;re gathering this information on a regular basis for ongoing analysis, it may be as important for us to look at data quality from an aggregate as well as a detail perspective.&nbsp; Yes, we can measure whether the individual tweet has relevant geographic coordinates and usable hashtags and some set of useful text expressions, but the aggregate may be more meaningful with social media data if it meets the right parameters and fits our needs.</p>
<ul dir="ltr">
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">How many records did we receive this month? </span></span></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">Were there shifts in geography for the month? </span></span></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">Were there shifts in positive or negative views for the month?</span></span></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">Were there shifts in references for one burger franchise vs. another?</span></span></li>
</ul>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
Once we&#39;ve identified that a given dataset has sufficient &#39;nutritive&#39; value for our organization and added some local knowledge as a useful check, these aggregate measures can help indicate shifts in content that could impact how and how well we can utilize the information over time.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
As always, the postings on this site are my own and don&#39;t necessarily represent IBM&#39;s positions, strategies or opinions.</p>
I regularly check out the articles from FastCompany&#39;s twin sites Co.DESIGN http://www.fastcodesign.com/ and Co.LABS http://www.fastcolabs.com/ .&nbsp; I really enjoy their mix of informative and eclectic articles, and the former particularly incorporates...001603urn:lsid:ibm.com:blogs:entries-17b40b29-ada7-4f6c-82c9-8fdc43b1cf50Journeys in the Information Landscape2015-01-07T17:11:57-05:00urn:lsid:ibm.com:blogs:entry-d74db384-988f-4657-b963-84a27d553203Sensing Big Data: Information Quality for Sensor-based datasmithha110000PAKNactivefalseComment Entriesapplication/atom+xml;type=entryLikestrue2013-06-28T14:37:23-04:002013-06-28T14:37:23-04:00<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
I noted previously that I&#39;m working with a team on a new <a href="http://www.redbooks.ibm.com/Redbooks.nsf/pages/about?Open">IBM&nbsp;Redbook</a> initiative around Big Data Governance.&nbsp; We&#39;re delving into <a href="http://www-01.ibm.com/software/data/bigdata/use-cases.html">the 5 game changing big data use cases</a> and the governance implications for each of them.&nbsp;&nbsp; I&#39;ve always been a big proponent of the axiom &quot;Know your Data&quot; and to that end I&#39;ve been looking at some of the distinct types of data in the Big Data Information Landscape to cut through the mystery of what information quality may mean in this new context.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
<span style="font-weight:bold;">An Internet of Things, or Sensors, Sensors, and more Sensors!</span></p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
There&#39;s nothing like <a href="http://en.wikipedia.org/wiki/Internet_of_Things">an Internet of Things</a> to help drive Big Data.&nbsp; It seems practically any mobile device can become a sensor these days, not to mention the range of RFID tags, machine sensors for weather, water, traffic, etc.&nbsp; An iPhone 4 includes eight distinct sensors such as an accelerometer, a GPS, a compass, and a gyroscope.&nbsp; And these types of sensors are driving new initiatives such as <a href="http://www.ibm.com/smarterplanet/us/en/smarter_cities/infrastructure/index.html">Smarter Cities</a>.&nbsp; A good example of such use is <a href="http://sfpark.org/how-it-works/the-sensors/">SFpark</a> helping drivers find parking spaces in San Francisco through 8200 parking sensors.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
<span style="font-weight:bold;">But what&#39;s in this Sensor data?</span></p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
From a data quality or governance perspective, there&#39;s obviously a large range of possible data generated, but I was curious to see what some examples actually look like.&nbsp; I started browsing something publicly available, specifically data from the <a href="http://w1.weather.gov/xml/current_obs/">National Weather Service</a>.&nbsp; The data comes from ~1800 tracking stations generated at hourly intervals on a daily basis.&nbsp; While feasible to look at some raw text data, there are two primary forms of data available:&nbsp; RSS and XML (and the RSS is just more truncated XML).&nbsp; You can get individual station data or zip files of all the data for a given time period.&nbsp; Overall, it makes for a nice starting point in getting to &quot;Know your Data&quot;!</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
<span style="font-weight:bold;">Just thinking about the weather</span></p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
I grabbed some zip files of both XML and RSS for three days at a couple time intervals and extracted the files.&nbsp; I found 4165, 4169, and 4171 files respectively by date of the format XXXX.xml or XXXX.rss.&nbsp; Just at this level, I had some immediate thoughts on information quality measures:&nbsp;</p>
<ul dir="ltr">
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">Did I pull the right file type?</span></span></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">Do the contents match the stated file type?</span></span></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">And given the lack of date in the file name, is this data I&#39;ve already picked up?</span></span></li>
</ul>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
Nothing unusual at this level -- if anything, it&#39;s business as usual.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
Opening up the XML file for station KSTP (St. Paul, MN -- station call letters I was very familiar with growing up), the file is run-of-the-mill XML.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
<a href="https://www.ibm.com/developerworks/community/blogs/haraldsmith/resource/BLOGS_UPLOADED_IMAGES/NWS_KSTP-xml_2013-06-27-detail.gif" style="width: 100%; display: inline-block;" target="_blank"><img alt="image" src="https://www.ibm.com/developerworks/community/blogs/haraldsmith/resource/BLOGS_UPLOADED_IMAGES/NWS_KSTP-xml_2013-06-27-detail.gif" style=" width:400px; display:block; margin: 0 auto;text-align: center;" /></a></p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
There&#39;s a Location, a Station ID, Latitude/Longitude, Observation Time, Temperature, and various Wind measurements.&nbsp; All nice structured content which means I could check for:&nbsp; <strong>Completeness</strong> (does the data exist), <strong>Format</strong> (does the data conform to expected structure), <strong>Validity</strong> (is it in the right value set or range).&nbsp; Checking out a subsequent day&#39;s record, I found some variation in the fields provided--typical for XML you can choose to include or not include certain fields so additional checks could be made for <strong>Consistency vs. the XML schema</strong> or <strong>Consistency over time intervals</strong>.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
Though not occurring in these samples, but certainly feasible for sensor data, is the possibility of sensor diagnostic or error codes.&nbsp; For instance, a temperature value of -200.0 could be an indicator that the sensor had an error condition and uses that available field to pass on the diagnostic.&nbsp; Depending on whether the sensor is external or internal, this may be an item to note as Incomplete or may be an item to trigger some notification/alert process.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
<span style="font-weight:bold;">Have you ever seen the rain?</span></p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
It&#39;s quite possible that an individual station reports weather that appears Complete, correctly Formatted, Valid, and Consistent and still have quality issues.&nbsp; Some additional factors to consider:</p>
<ul dir="ltr">
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">Are there data points for all intervals or expected intervals?&nbsp; This is a measure of <strong>Continuity for the data</strong> and can be applied for both individual sensors or groups of sensors.</span></span></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">Is there <strong>Consistency of data across proximate data points</strong>?&nbsp; If St.Paul, MN and Bloomington, MN both show temperatures at 84.0 F, but Minneapolis, MN shows a temperature of 34.0 F, the latter is probably an error as you would not expect that sharp a temperature variant in that close proximity of space.</span></span></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">Is there <strong>repetition/duplication</strong> of data <strong>across multiple recording intervals</strong>?&nbsp; There could certainly be the same data from a given sensor over multiple time periods, but is there a point at which these become suspicious and suggest an issue with the sensor?</span></span></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">Is there <strong>repetition/duplication of data across multiple sensors</strong>?&nbsp; There could be the same temperature, humidity, and wind for St.Paul, MN and Minneapolis, MN, but do you expect the exact same measurements between the two hour after hour?&nbsp; The samples I looked at certainly indicate some marginal variation consistent with different recording points.</span></span></li>
</ul>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
Given the volume of data points and the velocity or frequency of delivery, these may be as important as measures for Completeness or Validity if they are critical to analytic use.&nbsp; All of these can be monitored and followed over time as well, giving additional insight into trends of information quality.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
<span style="font-weight:bold;">The answer is blowin&#39; in the wind</span></p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
With some understanding of the data content and potential points of data quality failure, I come back to the value or <strong>fitness for purpose</strong> of the data.&nbsp;&nbsp; If I&#39;m evaluating the impact of the weather on my store-based sales vs. my online sales, I may want to correlate the hourly weather readings of stations close to my stores and close to my customer&#39;s billing addresses.&nbsp; Hourly gaps may impact this analysis, but I may be able to smooth over such gaps with other nearby sensor readings.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
If I&#39;m evaluating daily sales only leading up to Christmas, I may only care about the aggregate weather for the day such as Min/Max Temperature and Total Precipitation.&nbsp; Two or three out of 24 possible data points may be quite sufficient for my needs, and the impact of specific data quality issues from a given sensor drops with an increase in available data points for the time period or the general area.&nbsp; And conversely, if I only have one sensor with very sporadic data near a given store or customer, the impact of data quality issues grows significantly.&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
This suggests that the <em>weight of given measures for data quality is not constant for sensor data</em>, but is variable depending on factors in its use.&nbsp; And one additional quality measure may be an identification of the fit of the sensor data coverage/measures to the data I wish to analyze against it (i.e. if I&#39;m evaluating a store in an area where no sensors exist, I&#39;ve got nothing to evaluate against).&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
<span style="font-weight:bold;">What else from sensors?</span></p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
The Internet of Things, the instrumentation of many, many devices, will have a profound impact on the variety, volume, and velocity of incoming data to evaluate.&nbsp; Certainly this is just one example of the type of information available from sensors.&nbsp; However, in stepping through familiar data such as weather observations, not only do our common information quality measures hold up, but there are additional measures that can be put in place for ongoing monitoring.&nbsp; What becomes interesting is how the aggregation of such data may shift the quality requirements and the associated impact.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
Do you have other examples?&nbsp; I&#39;m curious how well these information quality measures hold with the range of sensor-based data that is emerging.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
As always, the postings on this site are my own and don&#39;t necessarily represent IBM&#39;s positions, strategies or opinions.</p>
I noted previously that I&#39;m working with a team on a new IBM&nbsp;Redbook initiative around Big Data Governance.&nbsp; We&#39;re delving into the 5 game changing big data use cases and the governance implications for each of them.&nbsp;&nbsp; I&#39;ve...102435urn:lsid:ibm.com:blogs:entries-17b40b29-ada7-4f6c-82c9-8fdc43b1cf50Journeys in the Information Landscape2015-01-07T17:11:57-05:00urn:lsid:ibm.com:blogs:entry-2f60e3b0-5258-4d9d-a6ba-5f3ffcbe9a67Who's Afraid of the Big Bad Data?smithha110000PAKNactivefalseComment Entriesapplication/atom+xml;type=entryLikestrue2013-06-19T14:11:57-04:002013-06-19T14:11:57-04:00<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
One of my colleagues sent me a link to the <a href="http://www.amazon.com/Bad-Data-Handbook-ebook/dp/B00A3IGAIA/ref=cm_sw_em_r_ask_WvtjF.1BE23GP_tt">Bad Data Handbook</a> by Q. Ethan McCallum.&nbsp; I will state clearly upfront that I have NOT read the book, however given my long history with Information Quality products and solutions, I certainly found the title <strong>intriguing</strong> and, of course, <strong>provocative</strong>!&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
I immediately had images of saying: <em>&quot;Bad datum!&nbsp; Bad datum!&nbsp; Off to your room this instant and don&#39;t come out until I call you for dinner!&quot;</em>&nbsp;<img alt="Angry" height="18" src="https://www.ibm.com/developerworks/community/connections/resources/web/com.ibm.oneui.ckeditor/editor/plugins/sametimeemoticons/images/EmoticonAngry.gif" title="Angry" width="18" /></p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
<span style="font-weight:bold;">But is there really such a thing as &quot;Bad Data&quot;?</span></p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
The summary of the book notes:&nbsp; &quot;Bad data is <span style="font-style:italic;">data that gets in the way</span>&hellip;.&quot;&nbsp; My first response is:&nbsp; <em>&quot;Gets in the way of what?&quot;</em>&nbsp; Reflecting back on my last post and the notions of &quot;Know your data&quot; and &quot;Fit for purpose&quot;, the idea of good data or bad data really comes back to the context in which the data is placed and used.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
Consider the following piece of data:</p>
<table border="1" cellpadding="0" cellspacing="0" dir="ltr" height="33" style="border-collapse: collapse; border-style: solid; border-color: rgb(163, 163, 163); border-width: 1pt; direction: ltr; vertical-align: top;" width="227">
<tbody>
<tr>
<td style="border-style: solid; border-color: rgb(163, 163, 163); border-width: 1pt; vertical-align: middle; width: 93px; padding: 4pt; text-align: left;">
<p style="font-family:Courier Newfont-size:11.0pt;">
JOHN_DOE</p>
</td>
<td style="border-style:solid;border-color:#A3A3A3;border-width:1pt;vertical-align:top;width:110px;padding:4pt 4pt 4pt 4pt;">
<p style="font-family:Calibri;font-size:11.0pt;">
FridayThe13th</p>
</td>
</tr>
</tbody>
</table>
<div dir="ltr" style="clear:both;">
&nbsp;</div>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
Is it good?&nbsp; Is it bad?&nbsp; Do I or can I even have an opinion on it?&nbsp; Not without establishing some context, some criteria of fitness, and an ability to assess or understand it against the context and criteria.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
<span style="font-weight:bold;">Putting data in context</span></p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
If I told you this was all or part of a tweet, I&#39;ve given you some additional understanding about the data, but not provided any additional context or criteria to say whether it is good, bad, or otherwise.&nbsp; I add a context:&nbsp; I&#39;m collecting tweets to ascertain when my customers are most or least likely to shop for certain goods so I can improve my marketing campaign.&nbsp; Well, with that I can look at the data and say it has a name and a day of the month.&nbsp; Good so far?&nbsp; I still can&#39;t say as there is no criteria to judge it on.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
I add some criteria:&nbsp; the data must contain names of customers and a positive or negative sentiment about the day in regards to shopping.&nbsp; Let&#39;s assume that the name does match a name in our customer master data system.&nbsp; But, there&#39;s no statement of sentiment, just the day of the month.&nbsp; At that point I can say that the data does not meet my criteria for my context -- it is not &quot;Fit for purpose&quot;, and I can conclude the data is &quot;bad&quot; in that context.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
If I change the context:&nbsp; I&#39;m following tweets by my friends indicating their available days to see a movie.&nbsp; My criteria changes along with the context:&nbsp; the name matches the name of a friend and the date given appears sufficient for my context.&nbsp; With this change in context and criteria, I conclude the data is &quot;good&quot;.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
<span style="font-weight:bold;">And now, for something completely different, it&#39;s...</span></p>
<p dir="ltr" style="font-size:11.0pt;">
<span style="font-family:calibri;">Going back to the data, if it turns out that this data is not part of a tweet but the contents of a file called </span><span style="font-family:courier new;">JOHNSPASSWORDS.txt</span><span style="font-family:calibri;">, then I&#39;m likely applying a totally different context </span><span style="font-family:calibri;">and my understanding of </span><span style="font-family:calibri;">it changes completely.&nbsp; If I&#39;m a security specialist for an organization tracking unencrypted passwords, then this data may hit that criteria and fall into the realm of &quot;bad data&quot;. <img alt="No" height="18" src="https://www.ibm.com/developerworks/community/connections/resources/web/com.ibm.oneui.ckeditor/editor/plugins/sametimeemoticons/images/EmoticonThumbsDown.gif" title="No" width="23" />&nbsp; If I&#39;m a hacker looking to find access into an organization&#39;s systems, then this may in fact be very &quot;good&quot; data!</span>&nbsp; <img alt="Yes" height="18" src="https://www.ibm.com/developerworks/community/connections/resources/web/com.ibm.oneui.ckeditor/editor/plugins/sametimeemoticons/images/EmoticonThumbsUp.gif" title="Yes" width="23" /></p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
Once you&#39;ve established the context and criteria, and provided some understanding of the data against those, then you can start making statements about value, cost, risk, or compliance -- the measures that indicate the degree to which the data supports or hinders those targets.&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
<span style="font-weight:bold;">The Big Bad Data</span><strong>, or the Bad Big Data?</strong></p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
Particularly as we move into the realm of Big Data with more data volume, more data variety, higher velocity or influx of data, and more questions about the veracity of the data (or even parts of it), I think the need for establishing the right context, criteria, and understanding becomes imperative.&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
An ongoing series of daily tweets for analyzing immediate social trends may prove to meet my needs or it may not (and may take some time to ascertain the value).&nbsp; But those tweets may have a limited shelf-life.&nbsp; If I&#39;m still storing them a year from now, have I turned them from a value-added asset into bad data that is now just a cost?&nbsp; Probably, since the criteria of immediate trending has past.&nbsp; Each Big Data case, though, is likely distinct -- working through what is and what may become &quot;bad&quot; data is going to be an ongoing Information Quality challenge.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
I&#39;m curious to see what the author discusses in this work and how it fits into the broader contexts of Information Governance and Big Data (Big Bad Data!).&nbsp; Have you read the book?&nbsp; If so, what are your thoughts on &quot;Bad Data&quot; and emerging challenges?&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
As always, the postings on this site are my own and don&#39;t necessarily represent IBM&#39;s positions, strategies or opinions.</p>
One of my colleagues sent me a link to the Bad Data Handbook by Q. Ethan McCallum.&nbsp; I will state clearly upfront that I have NOT read the book, however given my long history with Information Quality products and solutions, I certainly found the title...022966urn:lsid:ibm.com:blogs:entries-17b40b29-ada7-4f6c-82c9-8fdc43b1cf50Journeys in the Information Landscape2015-01-07T17:11:57-05:00urn:lsid:ibm.com:blogs:entry-ed53d85c-1c33-43c6-bc4d-d92ae29cd406Big Data Quality - Back to the Basicssmithha110000PAKNactivefalsesmithha110000PAKNactivefalseComment Entriesapplication/atom+xml;type=entryLikestrue2013-06-14T16:05:33-04:002013-06-14T16:10:57-04:00<p dir="ltr">
This week, I&#39;ve been part of a small team kicking off a new <a href="http://www.redbooks.ibm.com/Redbooks.nsf/pages/about?Open">IBM&nbsp;Redbook</a> initiative around Big Data Governance.&nbsp; We brought some solid and diverse backgrounds together from across the spectrum of information management solutions and products with lively and entertaining discussion on the subject -- I&#39;m very excited to be part of the initiative!</p>
<p dir="ltr">
The intersection of Big Data and Information Governance makes for a broad array of topics addressing data that ranges from social media feeds to sensor inputs to arrays of log files and beyond.&nbsp; Given my long-time work in Information Quality, I have a lot of interest in how we find and establish an effective quality focus for Big Data.&nbsp; From our discussions this week, two key aspects (really fundamental principles of Information Quality) stood out to me, which I would summarize with the phrases:&nbsp; &quot;Know your data&quot; and &quot;Fit for Purpose&quot;.&nbsp;</p>
<p dir="ltr">
<strong>&quot;Know your data&quot;</strong></p>
<p dir="ltr">
For years we&#39;ve preached that the first pillar for information integration is Understanding.&nbsp; Not only does this not change for Big Data, but in many cases you have to dig deeper and cast aside typical assumptions from the world of structured data.&nbsp; Consider a typical traditional operational system such as a Payroll application.&nbsp; The data represents the salaries of your employees and what they have been paid each pay period.&nbsp; You own the data, you control how it is entered and stored (or your application system manual tells you those details), and you either have the metadata or can get it.&nbsp;</p>
<p dir="ltr">
Now consider an external source, perhaps statistics on typical employee salaries by various occupational classes over the last five years.&nbsp; Who created the source?&nbsp; What methodology did they follow in collecting the data?&nbsp; Were only certain occupations or certain classes of individuals included?&nbsp; Did the creators summarize the information?&nbsp; Can you identify how the information is organized and if there is any correlation at any level to information that you have?&nbsp; Has the information been edited or modified by anyone else?&nbsp; Is there any way for you to ascertain this information?&nbsp;</p>
<p dir="ltr">
Aspects such as establishing the provenance (and possible lineage), the methods used in data capture, and the methods (statistical or otherwise) used in data filtering and aggregation, which have long been assumed with traditional data sources, all become core parts of Understanding when addressing Big Data.</p>
<p dir="ltr">
<strong>&quot;Fit for Purpose&quot;</strong></p>
<p dir="ltr">
What&#39;s the quality of a tweet?&nbsp; Or a sensor stream?&nbsp; Or a log file?&nbsp; Or a string of bits that define an image?&nbsp; Does the presence or absence of specific data matter?&nbsp;</p>
<p dir="ltr">
In the world of structured data, we look at a payroll record and say it is complete when the employee ID, payroll date, pay amount, and certain other fields contain values.&nbsp; We say it has integrity when the values in those fields have the right formats and correctly link to data in other tables.&nbsp; We say it has validity when the payroll date is the system date and the pay amount is in an established range.&nbsp; We set these rules when we established what was fit for purpose.</p>
<p dir="ltr">
In the world of Big Data, though, with such a variety and volume of data coming in at high velocity, it&#39;s hard to ascertain what information quality means and many of the traditional information quality measures seem to fall short.&nbsp; Is a tweet complete?&nbsp; Is it correctly formatted?&nbsp; Is it valid?&nbsp; The questions appear nonsensical.&nbsp; So we need to step back and ask &quot;what is fit for our purpose?&quot;&nbsp; And that leads to another question:&nbsp; &quot;what business objective am I trying to address and what value do I expect from that?&quot;&nbsp; If you can answer this second question, you can start building the parameters that establish what is fit for your purpose--i.e. your Business Requirements.</p>
<p dir="ltr">
The intersection of Understanding of the data with your Business Requirements brings you back to the point where you can establish the Information Quality needed for your Big Data initiative.&nbsp; These may not be the traditional structured data measurements.&nbsp; Completeness may indicate that a tweet contains one or more hashtags that you care about -- other tweets should be filtered out.&nbsp; You may need to look at Continuity as a dimension with sensor readings -- did I receive a continuous stream of information and if not, is there a tolerable gap for the data?</p>
<p dir="ltr">
<strong>Back to the Basics</strong></p>
<p dir="ltr">
These questions are not rocket science.&nbsp; In my mind, these questions are the basics of data analysis (and data science).&nbsp; Information Quality does not go away or disappear with Big Data -- instead Big Data requires us to strip away the assumptions from the structured data world view and ask the questions anew.</p>
<p dir="ltr">
&nbsp;</p>
<p dir="ltr">
And as always, the postings on this site are my own and don&#39;t necessarily represent IBM&#39;s positions, strategies or opinions.</p>
<p dir="ltr">
&nbsp;</p>
This week, I&#39;ve been part of a small team kicking off a new IBM&nbsp;Redbook initiative around Big Data Governance.&nbsp; We brought some solid and diverse backgrounds together from across the spectrum of information management solutions and products with...001857urn:lsid:ibm.com:blogs:entries-17b40b29-ada7-4f6c-82c9-8fdc43b1cf50Journeys in the Information Landscape2015-01-07T17:11:57-05:00urn:lsid:ibm.com:blogs:entry-da7bf1e2-a621-4efd-afb9-4f00c78e13efSome background to the journeysmithha110000PAKNactivefalsesmithha110000PAKNactivefalseComment Entriesapplication/atom+xml;type=entryLikestrue2013-06-07T11:12:14-04:002013-06-07T11:12:14-04:00<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
By way of introduction to myself and this blog, I&#39;m Harald Smith and I&#39;m currently a Software Architect with IBM in the Information Management division.&nbsp; I particularly work with many of the InfoSphere brand products and those will be part of my focus in my exploration of the information landscape here.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
As I look back over my career, I can note several points about how it has developed:</p>
<ul dir="ltr">
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">it has not followed a standard career path -- instead it&#39;s been a rather diverse journey often at the boundaries between business and technology</span></span></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">it&#39;s been heavily focused on information -- how we use information in applications and for business processes; how we ensure it has the right quality (as well as what that even means); how we protect information and comply with policies; or how we integrate it for new purposes</span></span></li>
<li style="margin-top:0;margin-bottom:0;vertical-align:middle;">
<span style="font-size:11.0pt;"><span style="font-family:calibri;">it&#39;s been focused on helping others understand how to work with information-driven products -- whether documenting best practices, methods, and approaches; managing the design and delivery of products for specific needs; or just responding to questions and issues</span></span></li>
</ul>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
These are themes that I hope to bring out and explore in this blog.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
Outside of my career itself, I like to hike and travel (more journeys!); I like history, art, and science in general (more exploration of diverse information); and I enjoy playing and designing games (though generally not video games).&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
I often find it surprising how these aspects inform my work.&nbsp; My interest in history has continued over the years through work on genealogy/family history.&nbsp; As you work back 4, 5, 6 generations, you quickly get to thousands of individuals with sporadic pieces of data, often of dubious quality.&nbsp; I run into the same questions there that I do with business information (particularly with a common surname of Smith).&nbsp; What constitutes good quality information?&nbsp; Is the context of the information right?&nbsp; What other sources of information (e.g. DNA) can help me connect and enrich what I already know?&nbsp; I have to determine which pieces of information I trust, which I&#39;ll integrate and which I&#39;ll exclude.&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
I hope to draw on these experiences and examples in exploring the information landscape.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
Whether looking at and working with information in work or personal context, I also tend to focus on common patterns.&nbsp; Such patterns allow us to develop approaches, techniques, or best practices to work with information.&nbsp; This is the core theme of the recent book, <a href="http://www.informit.com/store/patterns-of-information-management-9780133155501?WT.mc_id=Author_Chessell_PoIM">Patterns&nbsp;of&nbsp;Information&nbsp;Management</a>, co-authored by Mandy Chessell and myself and published by Pearson/IBM Press.&nbsp; As we have a separate <a href="https://www.ibm.com/developerworks/mydeveloperworks/groups/service/html/communityview?communityUuid=8b999d32-11d5-4f68-a06e-6825f3c78233">IBM developerWorks&nbsp;community</a> focused on the topics of the book, I will generally focus on those specific pattern topics in detail there, not here.&nbsp;&nbsp;</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
That said, there are plenty of discussion points across a broad range of information management topics which I hope to address here in subsequent blogs.&nbsp; If there are specific topics of interest to you regarding information management, please let me know and I will see if I can address them.</p>
<p dir="ltr" style="font-family:Calibri;font-size:11.0pt;">
And as always, the postings on this site are my own and don&#39;t necessarily represent IBM&#39;s positions, strategies or opinions.</p>
By way of introduction to myself and this blog, I&#39;m Harald Smith and I&#39;m currently a Software Architect with IBM in the Information Management division.&nbsp; I particularly work with many of the InfoSphere brand products and those will be part of my...001405urn:lsid:ibm.com:blogs:entries-17b40b29-ada7-4f6c-82c9-8fdc43b1cf50Journeys in the Information Landscape2015-01-07T17:11:57-05:00