It’s often been said in both the commercial real estate industry and brick and mortar businesses alike “It’s about location, location, location”. That statement has never been truer than in the technological world in which we live today. So,

Get the GIS

If the precision of location is absolutely vital and “Data is the blood running through the veins of the Networked Society”, then GIS should be considered the occipital lobe since it’s a large determinant of how we see our world. Every since the advent of GIS (Geographical Information Systems) and spatial analysis over five decades ago, the science, as some might say art, of geocoding has significantly impacted our world and how it functions today. For all of the relatively impressive capabilities and benefits, which come from having a digitally encoded representation of a significant portion of the structures and topography that comprise our world as we know it, Geocoding is not without its shortcomings and misgivings. These are partially due to technical debt, constraints inherent in systems, and processes and approaches developed over several decades based on questionable “best” practices and obsolete technology.

Geocoding has been a godsend in the areas of cartography and topography, providing benefits to both Government and Industry alike. More recently, consumers have benefited by way of automotive based navigation systems and countless benefits afforded by the proliferation of the smartphone throughout modern and developing societies.

For the uninitiated, Geocoding is the process of converting street address details derived from textual information into viable geospatial data. This data is typically comprised of Longitude, Latitude, and accompanying enrichment text that’s captured and stored in a computer-friendly format. Typically, it resides in a relational database management system, where it may be further enhanced with additional attributes through data mappings and/or user input.

In a perfect world

Alas, if only it were so simple that all street address data could be robust, resilient, and reliable. In the real world, it is rare that any of these conditions are true for long. To make matters worse, computer systems fed malformed data produce erroneous results — a notion captured in the phrase “Garbage In, Garbage Out” and abbreviated GIGO. GIS systems are particularly subject to the law of GIGO, and street address data is considered to be unusually dirty.

Leading causes of dirty street addresses:

Non-standard abbreviations

Attribute misorderings

Data entry mistakes

Given the various methods of data acquisition and capture, it’s no surprise that address data can wreak havoc upon the OSS (Operational Support Systems) and BSS (Business Support Systems) efforts and processes at some of the most inconvenient times. What results are costly delays in meeting deliverables ranging from projects and initiatives that can involve something as essential as market analysis, real estate planning, Accounting/Taxation, Finance Management, shipping, customer relations, sales, strategic planning and, goodness forbid the most critical of all, Billing and overall user experience (be afraid, be very afraid).

An issue that needs to be Addressed

Imagine — your day-to-day work is happening and some of the new strategic corporate directives are beginning to unfold for the new fiscal year. One initiative happens to involve Finance, Sales, and Marketing groups. Someone in Marketing wants to utilize the existing customer base to determine if they can pursue business opportunities within close proximity of existing company assets. These assets are currently tracked and managed in the Finance systems with the hopes of identifying low hanging fruit that can be utilized to more easily aide the Sales team in hitting Q2 target goals. Marketing acquires customer lead data by way of outside firm and internal data via historical quotes and existing contracts. A market analyst acquires data extracts from the Finance system to do detailed analysis, only to find they are not easily able to map and correlate the datasets between the business domains due to excessive street address mismatches. After several unsuccessful attempts to merge the data, an escalation occurs and senior management requests that IT engage to assist the business with resolving the issues surrounding the datasets so that the analyst can proceed with the market analysis.

The Data Warehouse team is then tasked by the CIO to bring the unwieldy street addresses under control. The Developers and DBAs naturally assume — given the view of the problem space from a data-centric standpoint — that normalization of the street address data is the ideal approach.

An old Normal

Normalization may not be the silver bullet many would assume, and instead be the defective ammo that backfires on you. Even when dealing with internal datasets, street address fields’ standards can vary based upon business domain as well as the individual tools themselves.

Not fully taking into consideration the multitude of street address formats when deriving data from an array of non-standard sources, and how that can lead to a world of pain for both IT and the business, both should be seriously prepared to cycle multiple times as they attempt to resolve what can become a major pain point if not prepared for that type of challenge.

A few of the initial obstacles of normalization are detailed below.

Approach

Challenge

Identifying the components parts of an address.

Multiple address sources tend to vary based on primary business function.

Transform into standardized format.

Each component of the address must be successfully mapped to its address attributes.

Implement fuzzy matching

The normalization logic will need to adequately identify the most likely address attributes to associate with each component of the input address.

If any of the challenges are not adequately addressed in the design, successfully tested and implemented, it can result in a multitude of issues turning those low grade quality and duplicate street address records into multiple street address records when attempting to merge datasets. This results in highly manual efforts and labor-intensive research to rectify the situation while potentially affecting end-user confidence and user experience. The following street addresses are for the same location, but depending upon the context may, or may not, be a usable street address for a particular business case. In addition, dependent upon DB id and key, constraints may, or may not, result in a key violation. Technical challenges aside, gaining interdepartmental business consensus to resolve duplicates record concerns can be a project in itself.

Address

Business case

1840 Century Park East, # 1200,

Los Angeles, CA 90007-21000

Sales

1840 Century Park E, 1200,

Los Angeles, CA, 90007-21000

Finance

1840 CENTURY PARK E, UNIT 1200,

LA, CA

Marketing

E Century Park 1840,

Los Angeles, CA, 90007

Planning

Given the aforementioned challenges, all hope is not lost; although if not appropriately implemented, accuracy and precision of the data are likely to be in jeopardy. Next, let’s take a deeper dive into the methodologies and techniques often used.

Substitution Based Normalization

Is a less complex method utilizing lookup tables to identify frequently encountered characters based on their string values.

Subsequently, the simplicity constrains the applicability since its capabilities are limited to correct abbreviations and immaterial data.

The technique of “tokenization” drawback should not be overlooked as its shortcomings could wreak havoc on the end result namely, where the street address may contain keywords that can also be assigned as an attribute.

Ex. “123 Street Drive East” whereby neither words are in the expected positions for street and that while “Street’ is a post thoroughfare type, it is also a valid street name as well.

Context-Based Normalization

Not as a commonly used methodology for addresses considered to be more complex and difficult to implement.

At this point, you may be thinking “Couldn’t you take a more simplified approach, assuming it may suit the business case at hand, along the lines of creating lookup tables and joins?” Perhaps, but that could just as well be risky business once fully deployed into production, resulting in overall downgraded system performance. Less than ideal table utilization and DB design can degrade performance significantly especially for data structures that are subject to both excessive read and write actions continuously. Placing the onus upon operations to mitigate the issue once in production by staying vigilante about the DB monitoring, management and indexing of servers and applications affected by implementation.

So, you think you want to Geocode

Geocoding done properly is not for the faint of heart. Particularly if the data is relied upon and will be used in mission critical, revenue impacting or moderate to high-risk operational use cases. If address management wasn’t already without its own challenges, and assuming your organization established an address management process in a costly and timely manner while ascending the ladder to world-class Address Management, let’s discuss geocoding.

For the sake of grounding this conversation and preventing going on ad nauseam, I’ll be succinct and focus solely on the geocoding process as it relates to North America or more specifically the TIGER (Topologically Integrated Geographic Encoding and Referencing) dataset format provided by the US Census Bureau. In a subsequent discussion, I’ll speak on other global regions as it relates to the topic of GIS in the Address Management space.

Breaking up is not easy

One of the initial steps in geocoding is normalization, otherwise, known as parsing. The primary objective of parsing is to break-up the unformatted input address string into a specifically defined formatted standard, that in the United States is USPS Publication 28.

Just to clarify if you’re going to be “breaking up” you’re going to need input, as in Address data. Now please note, address data, in general, can be fairly abundant. However, the majority of it is of low quality and rarely in its entirety complete as it could be. Which is something you probably knew, assumed or experienced previously? So, if you embark on Geocoding, please be part of the solution and not the problem by propagating low-grade address data, allowing it to proliferate throughout the corporate infrastructure and the data connected world. Just to restate a key point, the higher the quality reference data you start with, the better the odds for your success (remember GIGO).

A sense of normalcy

If you accept your mission and decide to stay the course with the plan to Geocode, I salute you (seems only right when you’re going into battle). Here, you’ll begin a 2-step process of normalization and standardization. Where you get the chance to take those “dirty” addresses and clean them up and hopefully make them into something respectable when it’s all said and done; however, there’s still a ways to go before you’ll get there. In short, the goal is to map address text with address attributes, again not nearly as simple as it may sound. Recall the 3 previously mentioned Normalization techniques. Yes, those done correctly can greatly assist in showing your data who’s boss, operative phrase “Done correctly”.

Assuming your normalization yielded high results, you’re ready for standardization. Put simply, standardization converts the normalized data into the correct format expected by the subsequent components of an address processing system, such as a geocoder. In layman’s terms, the Standardization performs a format conversion that’s finally geocoder ready. At this point, you’re not quite out of the woods just yet. You’re going to need to be equipped–ready to climb a mountain, and then some.

Code of the streets

Now, if you’re ready to do some geocoding, you have 2 options. Your first option is to roll your own geocoder. Your second option is to acquire a COTS Geocoder solution.

Option 1 – Roll your own

Creating your own geocoder is an effort that will largely reduce you and your team writing “mapping functions” to translate between normalized form data to target output. Ideally, the transformations are encoded within the mapping functions that are used for feature matching (points, lines and polygons). In addition, you’ll need to write code for a rules-based matching engine to identify the best matches for reference datasets more specifically known as Point datasets, Linear datasets, and Polygon datasets. Each of the dataset files contains their own data structures which need to be interpolated by the geocoder as well. I’m sure you probably gleaned from that brief summation of the process there is indeed a steep learning curve to endure.

Option 2 – Purchase COTS geocoder

Your second option is to acquire a COTS Geocoder solution that can come with its challenges related to software licensing fees, vendor support fees, hardware cost and vendor managed software updates along with patching and OS requirements and constraints. Oddly, enough Address Management is typically treated organizationally as an operating expense when it should probably be looked at more as a capital expenditure, all things considered.

Layers upon layers

Assuming you’ve taken the steps to plan, design, develop, test and deploy a geocoder, my hat’s off to you. Although giving the domain of expertise is not woefully in abundance in IT organizations outside the GIS software shops, you might end up wearing multiple hats now like database management, application management, platform development, devops, production support, and all around GIS subject matter expert. Don’t forget your original goal and objective was only Address Management. And all you needed originally were quality addresses so you could move forward and focus company resources on other business challenges within your organization. What just happened?

….Although I believe there may be a 3rd option that may be the needed for the Address Management and Location pain that persists to this day.