Identification and Record Linkage

Identification and Record Linkage Application Note

This page documents the identification and record linkage application note. It is part of AgGateway's ADAPT Minimum Viable Product (MVP).
The purpose of this application note is to document best practices for what to do when an ADAPT file contains an identifier that is not recognized by the system receiving it.

The purpose of this application note is to document best practices for what to do when an ADAPT file contains an identifier that is not recognized by the system receiving it.

Assumptions

Two or more agricultural businesses share some of the same customers and wish to exchange certain data in support of their mutual customers.

Introduction

Contemporary farming requires the continuous exchange of information between growers and partners such as agronomists, retailers, custom applicators, insurance agents and customers. A critical part of this interoperability is identification, where a name or code (the “identifier”) is used to reference a particular instance of a data object. This allows Farm Management Information Systems (FMIS) to distinguish that unique instance from others, and to recognize that instance when running into it again. It also enables Machine and Implement Control Systems (MICS) to keep track of what products to apply, and where to apply them.

There are many motivations for uniquely (and unambiguously) identifying resources in production agriculture data exchange. Examples include specifying the products being applied (or planned for application) in a particular field operation; specifying the location(s) where these products are applied; and enabling an audit trail (the aspiration of farm-to-fork traceability) for field operations processes.

Centralized approaches to identification are often used in agriculture: supply-chain operations increasingly use GTINs (Global Trade Item Numbers), GLNs (Global Location Numbers), and EANs (International Article Numbers), all codes minted by one or more numbering authorities. This makes clear where the identifier originated (its “source”) and what its meaning is. However, this approach doesn't currently work in field operations. A grower might use thousands of identifiers to name their farms, fields and documents. Those identifiers may be needed in situations with no Internet connectivity, and paying for identifiers is counter-intuitive to many end-users. The distributed minting of identifiers is the reality in field operations, as the various FMIS and MICS solutions in the marketplace typically create their own identifiers. An additional twist is there is little format standardization among these identifiers: different FMIS and MICS manufacturers use a variety of identifier data types, such as integers, GUIDs (Globally Unique Identifiers), URIs (Universal Resource Identifiers), and proprietary string hashes.

Workflows involving identification often break down when a grower (or other actor) imports data into their FMIS from an external MICS or FMIS. Incoming data (that may correspond to objects such as farms, fields, and products in the grower’s own FMIS) might use externally-minted identifiers that the grower’s FMIS does not recognize, and conceivably in a format that does not match the one used by the FMIS. The user must then manually match these unknown identifiers with known objects in their system. (The process can be supported by spatial overlap checking, string comparison metrics, and so forth). Users do not like this data mapping (also called record linkage, and object identification), because it's time-consuming and error-prone, and is ultimately an obstacle to the broader adoption of precision ag technologies. This is especially true as users increasingly have an expectation of friction-less data entry.

This note advocates mostly for users to avoid record linkage altogether by sharing identifiers across the ecosystem. Some notes are added about how to resolve record linkage problems when they are unavoidable.

How ADAPT Implements Identification

Figure 1: The CompoundIdentifier

All shareable data objects in ADAPT are identified with an ID property, a CompoundIdentifier (Figure 1). The CompoundIdentifier contains an integer ReferenceID and and unlimited number of UniqueIds. The ReferenceID is used to reference the CompoundIdentifier from other objects, but is only unique in the scope of the current instance of ApplicationDataModel. Following the ISO11783 convention, ReferenceIDs that originate in FMIS are positive, and those from MICS are negative.

Avoiding Record Linkage

The best way to deal with record linkage is to avoid it altogether. The solution proposed by the ADAPT team has two components: one technical, the other social.

The technical component:

Use the CompoundIdentifier to attach UniqueIds (irrespective of their format) to ADAPT objects you will be sharing.

Whenever possible, include the source of the identifier (e.g., who minted the identifier to begin with) along with the UniqueId.

The social component:

Make it easier for data exchange partners in the industry to recognize incoming identifiers.

Systems are encouraged to append their own UniqueID and pass that along with the data to trading partners. These data should be persisted and re-exported to create an on-board cross referencing system by which the data describes itself.

Solving Record Linkage

Even when there is perfect sharing of IDs across the industry, when an identifier first comes in "from the wild" (for example, because it was created in the cab of a machine) the first FMIS that consumes it has to solve the record linkage problem.

This is a topic where different companies can compete with sophisticated proprietary solutions to make the user experience as friction-less as possible, and fortunately there is an abundance of scientific literature on the topic, primarily from the field of health care. Here are a couple of tips:

Classic solutions to the record linkage problem rely on string comparison metrics. In agricultural field operations this is the way to go when matching the names of products, people, and equipment, for example.

Note that many MICS have limitations on the number of characters they can store and visualize for any given object's description. This can be a serious problem when dealing with long field/cropzone and product names, because it can lead to documenting the application of the wrong product and/or a wrong location. Since this can have serious regulatory implications for the farmer, we encourage systems manufacturers to be mindful of target systems' name-length limitations, and to export suitably-modified object descriptions when sending setup data to a MICS.

Agricultural field operations data often include spatial references. Leveraging the spatial footprint of field operations data can enable very effective solutions to matching up an incoming field/cropzone and logged data with the corresponding field/cropzone in an FMIS.

Remember that you also need to match things up in time: most FMIS need to match incoming harvest or as-applied data with a given crop year and/or crop-season-specific Cropzone. You will likely need to leverage the timestamps on the incoming field operations data, as well as any available crop information, to successfully solve the record linkage problem in this case.

More Sample code

Keys to a Successful Implementation

Understanding the CompundIdentifier class is critical to understanding how the elements in the ADAPT ApplicationDataModel relate to each other and relate to the various applications contributing to their values.

The ReferenceId property is used to identify the properties of one element across many other associated elements. The Field class is a good example.

The name, size, and field boundary of a given field aren't stored specifically within a Cropzone, but rather are referenced within the Cropzone by the FieldId.

The value of the integer FieldId property within a Cropzone instance must match the Id.ReferenceId property within the corresponding Field class instance.

It is also important to note that the specific value of the Field.Id.ReferenceId (and therefore the CropZone.FieldId) apply only within the context of the current ApplicationDataModel instance. Another instance of the data model may refer to the exact same Field using a different value for Id.ReferenceId.

Within each CompundIdentifier, however, is a collection of UniqueId objects, each of which identifies the immutable identifier of a specific application along with a Source property identifying the specific contributor. This UniqueId value then allows an application to relate a given data element within an ApplicationDataModel to some other element within their application's data store.

Open Issues

Missing within the ADAPT ApplicationDataModel at this time is the ability to identify change tracking. We expect to add it to the model in the future.

In the current version, applications wishing to merge elements from one or more ApplicationDataModels will need to apply a more "brute force" method of recognizing change. New elements are perhaps the easiest to identify. Since each application is encouraged to add their own UniqueId to the compound identifier of an element, any element that does not contain a UniqueId for a given application can be considered new and therefore inserted. The only way to determine whether elements have been modified by another application is to compare the individual properties of an element. Deletions are by far the hardest to detect. Since elements are only required to be in a model if they are referenced by another element, the absence of an element does not necessarily indicate it has been deleted.