Updating 2 tables related by 3 ID fields

I'm having a little trouble setting up this script. Here are the bullet points.

2 Tables have the same three fields. [SiteID], [EasyID], [FeatureID]. For simplicity, lets pretend these are the only fields (besides OID).

[FeatureID] is a unique integer for each feature created in Table 1.

[SiteID] is a non-unique integer and groups features by location.

[EasyID] is a string (typically a number) unique within the same [SiteID], but not unique in the field.

Table 1 - The master table. New features have all three ID fields populated.

Table 2 - New features always have [SiteID], sometimes [EasyID], and never a [FeatureID].

Table 2 - Features with [SiteID] and [EasyID], and no [FeatureID] may have a matching [SiteID] and [EasyID] in Table 1. If so, update [FeatureID] in Table 2.

Table 2 - Features missing [EasyID] need to be assigned the next number for that [SiteID] in either Table 1 or 2. If current EasyID's for a site are '1', '3', '4A', '6-7','S', and '55', new features would be '2', '4', '5', '6', '7', '8'....etc.

Table 2 - New features given a [FeatureID] when inserted into Table 1. The features are then updated in Table 2.

Table 1 - Finally, any features with a [SiteID] in Table 2, but the [FeatureID] and [EasyID] are not in Table 2, are inserted into Table 2

At the end of the script both tables should match.

I've looked through using model builder with no luck. With python dictionaries and search cursors i'm getting stuck trying to join against [SiteID] and [EasyID] at the same time. I am also don't know how to return the dictionaries with just integer EasyID's and loop through updating the next smallest integer.

import arcpy
#Tables
T1 = r"C:\Python\Scratch.gdb\Table1"
T2 = r"C:\Python\Scratch.gdb\Table2"
fields = ["FeatureID", "SiteID", "EasyID"]
# Get FeatureID dictionaries for each Table
T1Dict = {r[0]:(r[0:]) for r in arcpy.da.SearchCursor(T1, fields)}
T2Dict = {r[0]:(r[0:]) for r in arcpy.da.SearchCursor(T2, fields)}
# Get SiteID+EasyID dictionaries for each Table
T1ConcatDict = {str(r[1]) + "," + str(r[2]):(r[0]) for r in arcpy.da.SearchCursor(T1, fields)}
T2ConcatDict = {str(r[1]) + "," + str(r[2]):(r[0]) for r in arcpy.da.SearchCursor(T2, fields)}
#First, If T2.FeatureID is Null but T2.EasyID and T2.SiteID are in T1, Update T2.FeatureID
with arcpy.da.UpdateCursor(T2, fields) as updateRows:
for updateRow in updateRows:
# store the Join value by combining 3 field values of the row being updated in a keyValue variable
keyValue = str(updateRow[1]) + "," + str(updateRow[2])
# verify that the keyValue is in the Dictionary
if keyValue in T1ConcatDict & updateRow[0] is None & updateRow[1] is not None:
# transfer the value stored under the keyValue from the dictionary to the updated field: FeatureID.
updateRow[0] = T1ConcatDict[keyValue][0]
updateRows.updateRow(updateRow)
#Rebuild Dictionary if it is needed again
T2ConcatDict = {str(r[1]) + "," + str(r[2]):(r[0]) for r in arcpy.da.SearchCursor(T2, fields)}
T2Dict = {r[0]:(r[0:]) for r in arcpy.da.SearchCursor(T2, fields)}
'''
#Get Max(EasyID) within SiteID
NumberList = []
for value in T1Dict[2]:
try:
NumberList.append(int(value))
except ValueError:
continue
T1EasyNumberDict = [s[2] for s in T1Dict[2] if s.isdigit()]
T1MaxEasyDict = max(T1EasyNumberDict)
'''
#Second, If T2.FeatureID and T2.EasyID are Null, Update T2.EasyID with next smallest number (as string) in either T1 or T2 for the specific SiteID
with arcpy.da.UpdateCursor(T2, fields) as updateRows:
for updateRow in updateRows:
# store the Join value by combining 3 field values of the row being updated in a keyValue variable
keyValue = str(updateRow[1]) + "," + str(updateRow[2])
# verify that the keyValue is in the Dictionary
if keyValue in T1Dict & updateRow[0] is None & updateRow[1] is None:
# transfer the value stored under the keyValue from the dictionary to the updated field.
# Perhaps retrieve max int occurs here?
updateRow[2] = max(T1Dict[keyValue][2], T2Dict[keyValue[2])
updateRows.updateRow(updateRow)
#Third, Insert into T1 if T2.SiteID is not null and T2.EasyID is not null
#Forth, Update T2.FeatureID with T1.FeatureID from previous insert where T2.SiteID=T1.SiteID and T2.EasyID=T1.EasyID
#Lastly, Insert any T1 features into T2 where T1.EasyID not in (Select EasyID from T2 where T2.SiteID = T1.SiteID) and T1.FeatureID not in (Select FeatureID from T2)

if you are determined to implement the rule as you have describe it here is some code that should work to get a siteID dictionary for T2 that should find the sorted list of integers that are not yet used for each SiteID and assign the next unused number to the Null EasyID records associated with each SiteID. Also, do not bother filtering cursors if you intend to put it in a dictionary. It is faster to put everything into the dictionary and then do all of the logic tests, type validations, and list tracking in code. You might filter for the Null EasyID values prior running the updateCursor, but even if you have 1 million records to process they will take only about 5 minutes for the update cursor to run through all of them (and the SQL that filters for Null values might take longer, since Null values queries run pretty slow, especially if EasyID is not indexed).

import arcpy
import sys
#Tables
T1 = r"C:\Python\Scratch.gdb\Table1"
T2 = r"C:\Python\Scratch.gdb\Table2"
fields = ["SiteID", "EasyID", "FeatureID", "OID@"]
# Get the list of easyIDs associated with each SiteID in a dictionary for T1
T1SiteIDDict = {}
with arcpy.da.SearchCursor(T1, fields) as searchRows:
for searchRow in searchRows:
keyValue = searchRow[0]
if not keyValue in T1Dict:
# Key not in dictionary. Add Key pointing to a list of a list of field values
T1SiteIDDict[keyValue] = [searchRow[1]]
else:
# Append a list of field values to the list the Key points to
T1SiteIDDict[keyValue].append(searchRow[1])
del searchRows, searchRow
# Get the list of easyIDs associated with each SiteID in a dictionary for T2
T2SiteIDDict = {}
with arcpy.da.SearchCursor(T1, fields) as searchRows:
for searchRow in searchRows:
keyValue = searchRow[0]
if not keyValue in T1Dict:
# Key not in dictionary. Add Key pointing to a list of a list of field values
T2SiteIDDict[keyValue] = [searchRow[1]]
else:
# Append a list of field values to the list the Key points to
T2SiteIDDict[keyValue].append(searchRow[1])
del searchRows, searchRow
SideIDDict = {}
for keyValue in T2SiteIDDict.keys():
intList = []
for easyID in T2SiteIDDict[keyValue]:
if easyID.isnumeric():
# easyID is a number
if float(easyID) == int(easyID):
# easyID is an integer, so add it to the list
intList.append(int(easyID))
if keyValue in T1SiteIDDict:
for easyID in T1SiteIDDict[keyValue]:
if easyID.isnumeric():
# easyID is a number
if float(easyID) == int(easyID):
# easyID is an integer, so add it to the list
intList.append(int(easyID))
# remove already used numbers out of the numbers from 1 to 9999
# and get a sorted list stored for each SiteID
SiteIDDict[keyValue] = sorted(set(range(1, 10000)) - set(intList))
with arcpy.da.UpdateCursor(T2, fields) as updateRows:
for updateRow in updateRows:
if updateRow[1] == None:
templist = SiteIDDict[updateRow[0]]
updateRow[1] = templist[0]
# I believe updates of the list affect the list in the dictionary
templist.remove(updateRow[1])
updateRows.updateRow(updateRow)

What distinguishes the skipped records from the records that were step 6 inserted into the T2 table? I see no way to keep track of what is new in T1 since the last time the script was run to make that choice. All of the T2 insertions and the skipped records have no SiteID, so that is not a difference to make that choice. I don't agree that you have the steps in the correct order from what I can see. Your logic appears backwards, since normally I would deal with Nulls and new verses old records as my first steps in any comparison script, not towards the end. My scripts deal with new verses old by renaming existing data and deriving current data from another source, so that the comparison is easy to make. Possibly a variation of that approach would apply here, so that a last run version of the data is created so you can make sure you know what is really new and what you have processed before.

I would also add the ObjectID field for both tables to your field list and reorder the fields as:

["SideID", "EasyID", "FeatureID", "OID@"]

The ObjectID would be used in subroutines to impose order on Null values and to validate your assumptions of unique keys.

Anyway you never said if errors are occurring of if just unexpected values are being assigned. Unexpected values indicate a logic failure, while errors indicate a syntax or data validation failure. I am almost certain you will experience many logic errors developing and testing the script since there are so many dependencies at each stage that have to be considered, so develop only on test data and back up your data before trying it out on your live data.

My blog avoided going into anything this complex, because it hopes to make the core of the principles and the approach I was demonstrating easy to follow. You may want to look at this post to see an example of where I adapted the approach to deal with a much more complex many-to-many relationship between tables for further ideas about ways to vary the basic approach outlined in the Blog.

This script will run at night. The T1 features that don't have a Site ID in T2 are basically ignored until the T2 user adds the first feature with that SiteID. The user may have manually matched the SiteID and EasyID before the script runs. Additionally, T2 users adding the first occurances of features in either table will skip adding EasyID and rely on unique ones being generated.

One other note is once a feature has all 3 ID'S it will never be changed by the users.

Perhaps if I pre filter T1 to T2 SiteID, the script would be simpler. I believe you have identified step 1 as redundant since it occurs later and max(EasyID) is gained across both tables.

In what way are you getting stuck trying to join against [SiteID] and [EasyID] at the same time? Are you getting errors? The basic approach to a combined key will work as shown for the first 25 lines of code (I can't follow your overall logic beyond that).

Null values in the key most likely cause most of problems, so I would restructure the code order to deal with the second part of your script first to fill in Null values in the EasyID field before worrying about the FeatureID field at all. That involves processing a list of EasyIDs in the single key dictionary of just SideID key values first to verifying the unique value assumption for the non-Null EasyID values as well as filling in the Null EasyID values. Don't build dictionaries for T2 at all until they can be used. In any case, the cursor dictionary approach is your best option and can handle this whole set of processes, but each must happen in the correct order to avoid faulty assumptions about what set of fields contain unique values at each stage of the script.

Clearly you are dealing with a high complex interrelationship between these two tables and a large set of rules that I have yet to understand. I have no context how these records and values came into existence of what uses they will serve in the future. More crucially, You have given me no information about the interrelationship this script has with user actions. Every step you expect a user to do or not do creates a point of failure for your script and any of your rules and assumptions. If the user has to manually set off the script, you must always start by verifying they did everything you expected them to do and didn't do anything you didn't expect them to do related to your script assumptions.

Also, Xander is correct that mentioning my full name in a post puts a message in my inbox, which is the reason I saw this post when he did that.

I haven't gotten far enough to get any errors. I don't know how to build/filter the dictionary for T2.OID T2.SiteID string EasyID, then gather only integer EasyIDs, and return the 1st missing integer starting from 1. Without it I can't test updating Null EasyIDs in new T2 features.

Thank you for posting that link to stackexchange! It looks very similar the components of my scenario. I will update after testing the code.

Your approach still mystifies me and I still don't understand your business rules. Your rules may make sense to you and may be correct for your business needs, but on the surface they at least partially conflict with my experience in synchronizing data and matching tables. The picture in your head of how everything should work is not transferring into mine yet.

I highly recommend that you reconsider the rule that says:

Table 2 - Features missing [EasyID] need to be assigned the next number for that [SiteID] in either Table 1 or 2. If current EasyID's for a site are '1', '3', '4A', '6-7','S', and '55', new features would be '2', '4', '5', '6', '7', '8'....etc.

The EasyID is anything but easy to understand or program as you have described it. Filling in these blanks this way seems arbitrary to me especially given that the natural sort of the strings is actually '1', '3', '4A', '55', '6-7', 'S'. Why fill in blanks at all? Over time that means any deleted records will have their SiteID + EasyID combination reused for an entirely unrelated record, and therefore that key is only unique within the snapshot in time before the script reuses it. In other words, you will never be able to use the SiteID + EasyID key if you ever have to compare two different data snapshots that were taken before and after the script ran. This rule may make sense to you, but in my experience this is a bad database practice. Unique keys (single or multi-field) are only valuable in my experience if they are unique to one record over all time or support actual data relationships and become a problem if they are ever reused for completely unrelated records. I personally don't want to help implement this rule, since its seems excessively complicated to me, and I believe from experience that a day will come when you will want to use that key to recover from a data corruption event and the code that implements this rule will make that recovery nearly impossible. You also will greatly increase the likelihood of creating data corruption if you accidentally link together two snapshots that reassigned the same keys to different records.

I have several other questions about this EasyID field. How many characters are allowed in this field? Why does it contain letters and what is the significance of those letters? Why are there dashes to combine two numbers? Since this field is a string field, how do your users handle the fact that it will never sort numerically in any table, since you don't include leading spaces or strings to right-justify them? What type of business are you working for where this business process was developed to track any of this data in either table?

So what little I do understand (or think I understand) I will try to present some code that should fit your needs. This code is more or less what I would start with. Key fields always should come first in the field list and value fields always should follow. I would incorporate the OID field into the code processes and dictionaries as a fail safe unique key for linking back to the original table where ever the user defined keys turn out to be duplicated and not unique.

The code below handles both a 1:1 and 1:M relationship possibility, so even if the key value is not unique you will be able to trap that and fix it.

import arcpy
import sys
#Tables
T1 = r"C:\Python\Scratch.gdb\Table1"
T2 = r"C:\Python\Scratch.gdb\Table2"
fields = ["SiteID", "EasyID", "FeatureID", "OID@"]
# Intialize T1 as a dictionary
T1Dict = {}
# Initialize a list to hold any concatenated key duplicates found
T1KeyDups = []
# Open a search cursor and iterate rows
with arcpy.da.SearchCursor(T1, fields) as searchRows:
for searchRow in searchRows:
# Build a composite key value from 2 fields
keyValue = '{};{}'.format(searchRow[0], searchRow[1])
if not keyValue in T1Dict:
# Key not in dictionary. Add Key pointing to a list of a list of field values
T1Dict[keyValue] = [list(searchRow[2:])]
else:
# Key in dictionary is not unique.
T1KeyDups.append(keyValue)
# Append a list of field values to the list the Key points to
T1Dict[keyValue].append(list(searchRow[2:]))
del searchRows, searchRow
# Sample of how to access the keys, record count, and record values of the dictionary
for keyValue in T1Dict.keys():
for i in range(0, len(T1Dict[keyValue])):
print "The SiteID;EasyID key is {} with {} record(s). Record {} has FeatureID {} and ObjectID {}.".format(keyValue, len(T1Dict[keyValue]), i+1, T1Dict[keyValue][i][0], T1Dict[keyValue][i][1])
if len(T1KeyDups) > 0:
# Duplicate keys exist in T1
# Give a warning and either exit the script or else do a fix of T1 before proceeding
print("Duplicate keys found! They are:")
for keyValue in T1KeyDups:
for i in range(0, len(T1Dict[keyValue])):
print "The SiteID;EasyID key is {} with {} record(s). Record {} has FeatureID {} and ObjectID {}.".format(keyValue, len(T1Dict[keyValue]), i+1, T1Dict[keyValue][i][0], T1Dict[keyValue][i][1])
# Either exit or fix T1 here
sys.exit(-1)

if you are determined to implement the rule as you have describe it here is some code that should work to get a siteID dictionary for T2 that should find the sorted list of integers that are not yet used for each SiteID and assign the next unused number to the Null EasyID records associated with each SiteID. Also, do not bother filtering cursors if you intend to put it in a dictionary. It is faster to put everything into the dictionary and then do all of the logic tests, type validations, and list tracking in code. You might filter for the Null EasyID values prior running the updateCursor, but even if you have 1 million records to process they will take only about 5 minutes for the update cursor to run through all of them (and the SQL that filters for Null values might take longer, since Null values queries run pretty slow, especially if EasyID is not indexed).

import arcpy
import sys
#Tables
T1 = r"C:\Python\Scratch.gdb\Table1"
T2 = r"C:\Python\Scratch.gdb\Table2"
fields = ["SiteID", "EasyID", "FeatureID", "OID@"]
# Get the list of easyIDs associated with each SiteID in a dictionary for T1
T1SiteIDDict = {}
with arcpy.da.SearchCursor(T1, fields) as searchRows:
for searchRow in searchRows:
keyValue = searchRow[0]
if not keyValue in T1Dict:
# Key not in dictionary. Add Key pointing to a list of a list of field values
T1SiteIDDict[keyValue] = [searchRow[1]]
else:
# Append a list of field values to the list the Key points to
T1SiteIDDict[keyValue].append(searchRow[1])
del searchRows, searchRow
# Get the list of easyIDs associated with each SiteID in a dictionary for T2
T2SiteIDDict = {}
with arcpy.da.SearchCursor(T1, fields) as searchRows:
for searchRow in searchRows:
keyValue = searchRow[0]
if not keyValue in T1Dict:
# Key not in dictionary. Add Key pointing to a list of a list of field values
T2SiteIDDict[keyValue] = [searchRow[1]]
else:
# Append a list of field values to the list the Key points to
T2SiteIDDict[keyValue].append(searchRow[1])
del searchRows, searchRow
SideIDDict = {}
for keyValue in T2SiteIDDict.keys():
intList = []
for easyID in T2SiteIDDict[keyValue]:
if easyID.isnumeric():
# easyID is a number
if float(easyID) == int(easyID):
# easyID is an integer, so add it to the list
intList.append(int(easyID))
if keyValue in T1SiteIDDict:
for easyID in T1SiteIDDict[keyValue]:
if easyID.isnumeric():
# easyID is a number
if float(easyID) == int(easyID):
# easyID is an integer, so add it to the list
intList.append(int(easyID))
# remove already used numbers out of the numbers from 1 to 9999
# and get a sorted list stored for each SiteID
SiteIDDict[keyValue] = sorted(set(range(1, 10000)) - set(intList))
with arcpy.da.UpdateCursor(T2, fields) as updateRows:
for updateRow in updateRows:
if updateRow[1] == None:
templist = SiteIDDict[updateRow[0]]
updateRow[1] = templist[0]
# I believe updates of the list affect the list in the dictionary
templist.remove(updateRow[1])
updateRows.updateRow(updateRow)

Wow thank you for these excellent resources! I will try to digest this over the weekend, but this looks very promising. Sorry that I didn't provide all of the aspects of this dilemma. I tried to simplify it enough that python masters like yourself wouldn't have to read a novel. As you ascertained, this is complex.

The business table (T1) sits on a 2005 SQL server, unsupported since 10.2. EasyID is a 7 char string used for labeling at a site and it is the syncing bane of my existence. Users are allowed to assign "C5-C8" so that FeatureID represents numerous real world things. C5, C6, C7, and C8 could also exist in T1 individually. It is a data quality mess. To combat this the groupings will not make it to T2. Instead C5, C6, C7, and/or C8 will be digitized by the T2 user, if so desired.

Luckily, users are unable to delete rows and can't view or change the FeatureID or SiteID. The other table (T2) and a copy of the Sites table are layers in a hosted service for collecting new site features. I update the Sites with a scheduled script and thanks to Collector for ArcGIS 10.3 honoring relationships, all features maintain their SiteID.

In short, users are limited to picking an EasyID during feature creation in either table and they could possibly update EasyID on T2. On the off chance they do change EasyID on a feature existing in T1 and T2 there will be a script that joins by FeatureID and updates the one with an earlier edit date.

I could go on and on, but don't want to spoil your weekend. Thanks again Richard Fairhurst!

I've made some headway. My script is successfully insertings new rows into sql server with pyodbc. However, I've got an encoding issue when updating the hosted layer. The sql table varchar columns and I assume that hosted layers are nvarchar. The call to update the hosted layer succeeds but the text column doesn't actually update. Integer fields do.

I am not clear which part of the code is causing the problem and which field name is for an integer and which is for a string. Generally to use Python to write to a string field you have to enclose non-strings in the str() method. For converting strings to integers you must always test that the string is actually an integer and use Null (or some other default value for blank numbers) when the strings is not numeric. For example, " " is not an integer and will fail if you try to convert it to an integer. Potentially Null will also cause a problem.

Anyway, any time type conversion is involved in a script it is easy to produce errors and unexpected results until you include enough logic to handle both the values that can convert and the values that can't convert at each conversion step. To confirm for yourself what is failing you should include a print statement in your try except block to show you the actual value read from the source field that you are trying to write to the field that is rejecting the value or producing bad results.

Actually, line 230 is a print statement. Are you sure it is executing? You should break that whole expression down to multiple prints statements so you are certain exactly what arguments are being passed to the query and calculation. For example, you could try:

Then, if all of those inputs makes sense, execute the fl.calculate outside of a print operation.

There are some details to pay attention to. If CustomerID is numeric and the value makes sense, then the SQL should be fine. But if CustomerID is a string value then the SQL should fail, since the value is not within quotes. For a string CustomerID you would need: "OBJECTID={} AND CustomerID='{}'".format(updateRow[3], updateRow[0])

I figured it out. I was trying to update 2 different fields at once and field calculator understandably only takes one. Its crazy how much time I looked up information on unicode stuff only to find it was a misuse of the parameter.

Can't thank you enough Richard. I should be able to fumble through the rest of the conditionals.